vllm-project · linyueqian · May 5, 2026 · May 5, 2026 · May 5, 2026 · May 5, 2026
@@ -211,8 +211,8 @@ See `plan/voxcpm2_native_ar_design.md`.
 
 - Model files in `vllm_omni/model_executor/models/<model_name>/`
 - Stage config YAML
-- Working `end2end.py` with correct audio output
-- README.md in the example directory
+- Working `end2end.py` at `examples/offline_inference/text_to_speech/<model>/end2end.py`
+- New section in `examples/offline_inference/text_to_speech/README.md` (table row + per-model section). Do **not** create a top-level `examples/offline_inference/<model>/` dir or a per-model `README.md` inside `text_to_speech/<model>/` — the hub README is the documented surface and the mkdocs `generate_examples` hook only descends one level into `examples/<category>/`.
 
 ## Phase 3: Online Serving
 
@@ -308,11 +308,11 @@ def build_voice_clone_prompt(ref_audio_path: str, text: str, codec) -> list:
 ### Deliverables
 
 - Updated `serving_speech.py` with all 5 integration points (single commit)
-- Client scripts and server launcher
-- Gradio demo with streaming and voice cloning UI
+- Client scripts and server launcher under `examples/online_serving/text_to_speech/<model>/`
+- Gradio demo with streaming and voice cloning UI in the same dir
 - E2E online serving test (`tests/e2e/online_serving/test_<model>.py`)
 - Buildkite CI entry in `.buildkite/test-merge.yml`
-- Documentation (offline + online serving docs)
+- New section in `examples/online_serving/text_to_speech/README.md` (table row + per-model section). Do **not** create a top-level `examples/online_serving/<model>/` dir or a per-model `README.md` inside `text_to_speech/<model>/`.
 
 ### E2E test pitfalls to avoid
 

@@ -186,6 +186,17 @@ vllm_omni/model_executor/stage_configs/
   your_model_name_async_chunk.yaml   # Streaming mode config
 ```
 
+### Example placement
+
+TTS examples live in the consolidated text-to-speech hub, **not** in their
+own top-level directory. Place per-model scripts under
+`examples/offline_inference/text_to_speech/<your_model>/` and
+`examples/online_serving/text_to_speech/<your_model>/`, and add a section
+to the hub `README.md` files (table row + per-model section) instead of a
+new per-model `README.md`. The mkdocs `generate_examples` hook treats the
+`text_to_speech/` parent as a single example, so per-model READMEs inside
+it would not be picked up — the hub README is the documented surface.
+
 **Qwen3-TTS reference files:**
 
 | File | Purpose |

@@ -9,7 +9,7 @@ vLLM-Omni supports two deployment modes:
 | Thinker + Talker (omni-speech, default) | `vllm_omni/deploy/ming_flash_omni.yaml` | Text + Audio |
 | Thinker only (multimodal understanding) | `vllm_omni/deploy/ming_flash_omni_thinker_only.yaml` | Text |
 
-For standalone TTS (talker only), see [`examples/offline_inference/ming_flash_omni_tts/`](../ming_flash_omni_tts/).
+For standalone TTS (talker only), see the [Ming-flash-omni-TTS section in the Text-To-Speech hub](../text_to_speech/README.md#ming-flash-omni-tts).
 
 ## Setup
 

@@ -16,9 +16,11 @@ list of supported architectures across all modalities, see
 |---|---|---|---|---|---|---|
 | CosyVoice3 | `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | 2 (talker + code2wav) | ✓ | ✓ | — | 22.05 kHz |
 | Fish Speech S2 Pro | `fishaudio/s2-pro` | dual-AR | ✓ | ✓ | — | 44.1 kHz |
+| Ming-flash-omni-TTS | `Jonathan1909/Ming-flash-omni-2.0` | single (talker only) | — (caption-controlled) | — | style / IP / basic captions | 44.1 kHz |
+| MOSS-TTS-Nano | `OpenMOSS-Team/MOSS-TTS-Nano` | single (AR + codec) | ✓ (required) | ✓ | voice_clone, continuation | 48 kHz |
 | OmniVoice | `k2-fsa/OmniVoice` | 2 (gen + dec) | ✓ | — | voice design, language hint | 24 kHz |
 | Qwen3-TTS | `Qwen/Qwen3-TTS-12Hz-1.7B-{CustomVoice,VoiceDesign,Base}` | 2 (talker + code2wav) | ✓ (Base) | ✓ | 3 task variants | 24 kHz |
-| VoxCPM | local model dir | split | ✓ | ✓ | — | 24 kHz |
+| VoxCPM | `openbmb/VoxCPM-0.5B` | split | ✓ | ✓ | — | 24 kHz |
 | VoxCPM2 | `openbmb/VoxCPM2` | single (native AR) | ✓ | ✓ (online) | continuation | 48 kHz |
 | Voxtral TTS | `mistralai/Voxtral-4B-TTS-2603` | varies | ✓ | ✓ | voice presets | 24 kHz |
 
@@ -126,6 +128,76 @@ Streaming requires `async_chunk: true` in the stage config.
 
 ---
 
+## Ming-flash-omni-TTS
+
+Standalone talker-only deployment of Ming-flash-omni-2.0 at 44.1 kHz. Voice is controlled through caption fields (`风格` / `IP` / `语速`/`基频`/`音量`) rather than reference audio.
+
+### Prerequisites
+The example calls into `vllm_omni.model_executor.models.ming_flash_omni.prompt_utils` for the default prompt and instruction builder; no extra pip install on top of the base vLLM-Omni install.
+
+### Quick start
+```bash
+python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case style
+```
+
+### Cases
+```bash
+# ASMR-style whisper (caption-driven)
+python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case style
+
+# IP voice (preset character voice via caption)
+python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case ip
+
+# Basic speed/pitch/volume control
+python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case basic
+```
+
+Override the default text per case with `--text`, write to a custom path with `--output`.
+
+### Notes
+- Talker-only deployment — for the multimodal Ming-flash-omni example, see [`examples/offline_inference/ming_flash_omni/`](../../ming_flash_omni/).
+- Deploy config: `vllm_omni/deploy/ming_flash_omni_tts.yaml` (single GPU, `enforce_eager`, `max_num_seqs: 1`).
+- Decode defaults from the Ming cookbook: `max_decode_steps=200`, `cfg=2.0`, `sigma=0.25`, `temperature=0.0`, `use_zero_spk_emb=True`.
+
+---
+
+## MOSS-TTS-Nano
+
+Single-stage 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec at 48 kHz mono (mixed down from upstream stereo). ZH / EN / JA. Every request requires a reference clip via `--ref-audio`.
+
+> **No built-in speaker presets.** `--ref-audio` is required on every call. Default `--mode voice_clone` matches upstream's recommended workflow; `--mode continuation` is exposed for completeness but upstream's continuation-with-prompt path emits very short / near-silent output, so it is rarely useful in practice. Sample reference clips ship in the upstream repo under [`assets/audio/`](https://github.com/OpenMOSS/MOSS-TTS-Nano/tree/main/assets/audio) (e.g. `zh_1.wav`, `en_2.wav`, `jp_2.wav`).
+
+### Quick start
+```bash
+# Fetch a sample reference clip (one-off, user-scoped cache).
+REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
+mkdir -p "$REF_DIR"
+[ -s "$REF_DIR/zh_1.wav" ] || \
+    curl -L -o "$REF_DIR/zh_1.wav" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav
+
+python examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py \
+    --text "你好，这是MOSS-TTS-Nano的语音合成演示。" \
+    --ref-audio "$REF_DIR/zh_1.wav"
+```
+The first run downloads `OpenMOSS-Team/MOSS-TTS-Nano` and `OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano` from Hugging Face.
+
+### Reproducible runs
+```bash
+python examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py \
+    --text "Deterministic test." \
+    --ref-audio "$REF_DIR/en_2.wav" \
+    --seed 42
+```
+
+### Notes
+- Output: 48 kHz mono WAV (the tokenizer is internally stereo at 48 kHz; the wrapper averages to mono before reaching the engine).
+- Deploy config: `vllm_omni/deploy/moss_tts_nano.yaml` (auto-loaded; override with `--deploy-config`).
+- Default `--max-new-frames 375` ≈ 14 s of audio; raise for longer outputs.
+- `--ref-text` is rejected in `voice_clone` mode and required only with `--mode continuation`.
+- Run `--help` for the full sampling-knob surface (`--audio-temperature`, `--audio-top-k`, `--audio-top-p`, `--text-temperature`).
+
+---
+
 ## OmniVoice
 
 Zero-shot multilingual TTS supporting 600+ languages, with three modes (auto / clone / design).