diff --git a/.claude/skills/add-tts-model/SKILL.md b/.claude/skills/add-tts-model/SKILL.md
index 963ffb4f64d..811a359d0e4 100644
--- a/.claude/skills/add-tts-model/SKILL.md
+++ b/.claude/skills/add-tts-model/SKILL.md
@@ -211,8 +211,8 @@ See `plan/voxcpm2_native_ar_design.md`.
 
 - Model files in `vllm_omni/model_executor/models/<model_name>/`
 - Stage config YAML
-- Working `end2end.py` with correct audio output
-- README.md in the example directory
+- Working `end2end.py` at `examples/offline_inference/text_to_speech/<model>/end2end.py`
+- New section in `examples/offline_inference/text_to_speech/README.md` (table row + per-model section). Do **not** create a top-level `examples/offline_inference/<model>/` dir or a per-model `README.md` inside `text_to_speech/<model>/` — the hub README is the documented surface and the mkdocs `generate_examples` hook only descends one level into `examples/<category>/`.
 
 ## Phase 3: Online Serving
 
@@ -308,11 +308,11 @@ def build_voice_clone_prompt(ref_audio_path: str, text: str, codec) -> list:
 ### Deliverables
 
 - Updated `serving_speech.py` with all 5 integration points (single commit)
-- Client scripts and server launcher
-- Gradio demo with streaming and voice cloning UI
+- Client scripts and server launcher under `examples/online_serving/text_to_speech/<model>/`
+- Gradio demo with streaming and voice cloning UI in the same dir
 - E2E online serving test (`tests/e2e/online_serving/test_<model>.py`)
 - Buildkite CI entry in `.buildkite/test-merge.yml`
-- Documentation (offline + online serving docs)
+- New section in `examples/online_serving/text_to_speech/README.md` (table row + per-model section). Do **not** create a top-level `examples/online_serving/<model>/` dir or a per-model `README.md` inside `text_to_speech/<model>/`.
 
 ### E2E test pitfalls to avoid
 
diff --git a/docs/contributing/model/adding_tts_model.md b/docs/contributing/model/adding_tts_model.md
index 34fd2dbb503..3e5ae30df6d 100644
--- a/docs/contributing/model/adding_tts_model.md
+++ b/docs/contributing/model/adding_tts_model.md
@@ -186,6 +186,17 @@ vllm_omni/model_executor/stage_configs/
   your_model_name_async_chunk.yaml   # Streaming mode config
 ```
 
+### Example placement
+
+TTS examples live in the consolidated text-to-speech hub, **not** in their
+own top-level directory. Place per-model scripts under
+`examples/offline_inference/text_to_speech/<your_model>/` and
+`examples/online_serving/text_to_speech/<your_model>/`, and add a section
+to the hub `README.md` files (table row + per-model section) instead of a
+new per-model `README.md`. The mkdocs `generate_examples` hook treats the
+`text_to_speech/` parent as a single example, so per-model READMEs inside
+it would not be picked up — the hub README is the documented surface.
+
 **Qwen3-TTS reference files:**
 
 | File | Purpose |
diff --git a/docs/user_guide/examples/offline_inference/voxtral_tts.md b/docs/user_guide/examples/offline_inference/voxtral_tts.md
deleted file mode 100644
index c6f41ac0875..00000000000
--- a/docs/user_guide/examples/offline_inference/voxtral_tts.md
+++ /dev/null
@@ -1,68 +0,0 @@
-# Voxtral TTS Offline Inference
-
-Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/voxtral_tts>.
-
-
-`end2end.py` runs Voxtral TTS end-to-end offline inference using vLLM. It supports both blocking (`Omni`) and streaming (`AsyncOmni`) generation, batched prompts with configurable concurrency, and voice selection via preset name or reference audio file.
-
-When `mistral_common` has `SpeechRequest` support, prompt token IDs are built via `encode_speech_request`. Otherwise, the script falls back to manual token construction.
-
-## Usage Examples
-
-
-```bash
-# Basic single-prompt with cheerful_female voice preset
-python3 examples/offline_inference/voxtral_tts/end2end.py \
-    --write-audio --voice cheerful_female \
-    --model mistralai/Voxtral-4B-TTS-2603 \
-    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
-
-# 32 replicate prompts with cheerful_female voice preset
-python3 examples/offline_inference/voxtral_tts/end2end.py \
-    --num-prompts 32 --write-audio --voice cheerful_female \
-    --model mistralai/Voxtral-4B-TTS-2603 \
-    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
-
-# Streaming with neutral_female voice preset
-python3 examples/offline_inference/voxtral_tts/end2end.py \
-    --streaming --write-audio --voice neutral_female \
-    --model mistralai/Voxtral-4B-TTS-2603 \
-    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
-
-# 32 prompts, 8 concurrent requests per wave, streaming with neutral_female voice
-python3 examples/offline_inference/voxtral_tts/end2end.py \
-    --num-prompts 32 --concurrency 8 --streaming --write-audio --voice neutral_female \
-    --model mistralai/Voxtral-4B-TTS-2603 \
-    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
-
-# Short debug prompt with reference audio
-python3 examples/offline_inference/voxtral_tts/end2end.py \
-    --write-audio \
-    --model mistralai/Voxtral-4B-TTS-2603 \
-    --text "This is a test message." \
-    --audio-path path/to/reference_audio.wav
-```
-
-## Arguments
-
-| Argument | Description |
-|---|---|
-| `--model PATH` | HuggingFace repo ID or local directory path (default: `mistralai/Voxtral-4B-TTS-2603`) |
-| `--text TEXT` | Text to synthesize (default: `"This is a test message."`) |
-| `--audio-path PATH` | Path to reference audio file for voice cloning |
-| `--output-dir DIR` | Directory to write output WAV files (default: `output_audio`) |
-| `--deploy-config PATH` | Override the deploy config path. If unset, auto-loads `vllm_omni/deploy/voxtral_tts.yaml` from the HF `model_type`. |
-| `--num-prompts N` | Number of replicate prompts to run for measuring performance (default: 1) |
-| `--streaming` | Use streaming generation via `AsyncOmni` (default: blocking `Omni`) |
-| `--concurrency N` | Max concurrent requests per wave (must be used with `--streaming`, must evenly divide `--num-prompts`) |
-| `--voice NAME` | Voice preset to use instead of reference audio file (e.g., casual_female, casual_male, cheerful_female, neutral_female, neutral_male) |
-| `--write-audio` | Write generated audio to WAV files |
-| `--profiling-mode` | Enable profiling mode (reduces max tokens to 50) |
-| `--log-stats` | Enable detailed statistics logging |
-
-## Example materials
-
-??? abstract "end2end.py"
-    ``````py
-    --8<-- "examples/offline_inference/voxtral_tts/end2end.py"
-    ``````
diff --git a/examples/offline_inference/ming_flash_omni/README.md b/examples/offline_inference/ming_flash_omni/README.md
index 179925bb68e..4b798619e2f 100644
--- a/examples/offline_inference/ming_flash_omni/README.md
+++ b/examples/offline_inference/ming_flash_omni/README.md
@@ -9,7 +9,7 @@ vLLM-Omni supports two deployment modes:
 | Thinker + Talker (omni-speech, default) | `vllm_omni/deploy/ming_flash_omni.yaml` | Text + Audio |
 | Thinker only (multimodal understanding) | `vllm_omni/deploy/ming_flash_omni_thinker_only.yaml` | Text |
 
-For standalone TTS (talker only), see [`examples/offline_inference/ming_flash_omni_tts/`](../ming_flash_omni_tts/).
+For standalone TTS (talker only), see the [Ming-flash-omni-TTS section in the Text-To-Speech hub](../text_to_speech/README.md#ming-flash-omni-tts).
 
 ## Setup
 
diff --git a/examples/offline_inference/ming_flash_omni_tts/README.md b/examples/offline_inference/ming_flash_omni_tts/README.md
deleted file mode 100644
index d0ad9b30d2f..00000000000
--- a/examples/offline_inference/ming_flash_omni_tts/README.md
+++ /dev/null
@@ -1,47 +0,0 @@
-# Ming-flash-omni Standalone TTS (Offline)
-
-This example runs **Ming-flash-omni-2.0 talker-only** offline inference with:
-
-- `model`: `Jonathan1909/Ming-flash-omni-2.0`
-- `deploy config`: `vllm_omni/deploy/ming_flash_omni_tts.yaml`
-
-It follows the Ming cookbook parameter style:
-
-- `prompt`: `"Please generate speech based on the following description.\n"`
-- `max_decode_steps`: `200`
-- `cfg`: `2.0`
-- `sigma`: `0.25`
-- `temperature`: `0.0`
-
-## Quick Start
-
-```bash
-python examples/offline_inference/ming_flash_omni_tts/end2end.py --case style
-```
-
-## Cases
-
-```bash
-# Style
-python examples/offline_inference/ming_flash_omni_tts/end2end.py --case style
-
-# IP
-python examples/offline_inference/ming_flash_omni_tts/end2end.py --case ip
-
-# Basic (speed/pitch/volume control)
-python examples/offline_inference/ming_flash_omni_tts/end2end.py --case basic
-```
-
-## Useful Arguments
-
-- `--text`: override default text in the selected case
-- `--output`: custom output wav path
-- `--model`: local model path or HF repo id
-- `--deploy-config`: custom talker deploy YAML path
-- `--log-stats`: enable runtime stats logs
-
-## Notes
-
-- This directory is for **standalone talker deployment (TTS)**.
-- For Ming thinker multimodal understanding examples, see:
-  `examples/offline_inference/ming_flash_omni/`.
diff --git a/examples/offline_inference/moss_tts_nano/README.md b/examples/offline_inference/moss_tts_nano/README.md
deleted file mode 100644
index d2a7051400b..00000000000
--- a/examples/offline_inference/moss_tts_nano/README.md
+++ /dev/null
@@ -1,97 +0,0 @@
-# MOSS-TTS-Nano Offline Inference
-
-## Overview
-
-Single-stage offline TTS pipeline using the 0.1B MOSS-TTS-Nano AR LM and MOSS-Audio-Tokenizer-Nano codec. Outputs 48 kHz mono WAV (the upstream tokenizer is stereo at 48 kHz; the wrapper mixes down to mono so it lines up with the rest of the engine's single-channel audio path).
-
-> **No built-in speaker presets.** Every request needs `--prompt-audio`
-> (a reference clip). The default `--mode voice_clone` is upstream's
-> recommended workflow and is the only mode the OpenAI server exposes;
-> the offline CLI also exposes `--mode continuation` for completeness,
-> but note that upstream's continuation-with-prompt path emits very
-> short / near-silent output, so it is rarely useful in practice. See
-> upstream's `infer.py` for the full surface.
->
-> Sample reference clips ship in the upstream repo under
-> [`assets/audio/`](https://github.com/OpenMOSS/MOSS-TTS-Nano/tree/main/assets/audio)
-> (e.g. `zh_1.wav`, `en_2.wav`, `jp_2.wav`).
-
-## Quick Start
-
-```bash
-# Fetch a sample reference clip from upstream (one-off, user-scoped cache).
-REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
-mkdir -p "$REF_DIR"
-[ -s "$REF_DIR/zh_1.wav" ] || \
-    curl -L -o "$REF_DIR/zh_1.wav" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav
-
-python end2end.py \
-    --text "你好，这是MOSS-TTS-Nano的语音合成演示。" \
-    --prompt-audio "$REF_DIR/zh_1.wav"
-```
-
-The first run downloads `OpenMOSS-Team/MOSS-TTS-Nano` and `OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano` from Hugging Face.
-
-## Usage
-
-```
-python end2end.py [OPTIONS]
-
-Required:
-  --prompt-audio PATH       Reference WAV/MP3 for voice cloning / continuation
-
-Options:
-  --text TEXT               Text to synthesize (default: "Hello, this is MOSS-TTS-Nano speaking.")
-  --prompt-text TEXT        Optional. Required only with --mode continuation;
-                            rejected by upstream in --mode voice_clone.
-  --mode MODE               voice_clone (default) or continuation
-  --max-new-frames N        Max AR frames, default 375 (~14 s audio)
-  --seed INT                Random seed for reproducibility
-  --audio-temperature F     Audio sampling temperature (default: 0.8)
-  --audio-top-k N           Audio top-k sampling (default: 25)
-  --audio-top-p F           Audio top-p sampling (default: 0.95)
-  --text-temperature F      Text layer temperature (default: 1.0)
-  --output-dir DIR          Directory for WAV outputs (default: $XDG_CACHE_HOME/moss_tts_nano_output, falls back to ~/.cache/...)
-  --deploy-config PATH      Override deploy YAML (defaults to vllm_omni/deploy/moss_tts_nano.yaml)
-  --stage-init-timeout INT  Timeout in seconds for stage init (default: 120)
-```
-
-## Examples
-
-```bash
-REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
-
-# Chinese reference clip → Chinese synthesis (voice_clone, default)
-python end2end.py \
-    --text "你好，这是 MOSS-TTS-Nano 的语音合成测试。" \
-    --prompt-audio "$REF_DIR/zh_1.wav"
-
-# Reproducible output
-python end2end.py \
-    --text "Deterministic test." \
-    --prompt-audio "$REF_DIR/en_2.wav" \
-    --seed 42
-```
-
-## Deploy Config
-
-Runtime knobs live in `vllm_omni/deploy/moss_tts_nano.yaml` (auto-loaded;
-override with `--deploy-config PATH`). Key stage-level settings:
-
-```yaml
-stages:
-  - stage_id: 0
-    gpu_memory_utilization: 0.3   # ~2 GB VRAM; increase for faster init
-    max_num_seqs: 4               # concurrent requests
-    max_model_len: 4096
-```
-
-## Output Format
-
-WAV files, 48 kHz, mono. The MOSS audio tokenizer is internally stereo (2-channel) at 48 kHz; the wrapper averages the two channels into mono before reaching the engine, so playback duration / pitch are correct against the WAV header's 48 kHz rate.
-
-## Troubleshooting
-
-- **`libnvrtc.so.13: cannot open shared object file`**: torchaudio 2.10+ torchcodec backend requires NVRTC. The model patches `torchaudio.load/save` automatically at load time to fall back to soundfile.
-- **`flash_attn not installed`**: The model falls back to `sdpa` attention automatically.
-- **Empty audio**: Check that `--text` is non-empty and the model loaded successfully (look for "MOSS-TTS-Nano LM loaded" in logs).
diff --git a/examples/offline_inference/text_to_speech/README.md b/examples/offline_inference/text_to_speech/README.md
index a457c6c0a91..ddc5f11c16b 100644
--- a/examples/offline_inference/text_to_speech/README.md
+++ b/examples/offline_inference/text_to_speech/README.md
@@ -16,9 +16,11 @@ list of supported architectures across all modalities, see
 |---|---|---|---|---|---|---|
 | CosyVoice3 | `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | 2 (talker + code2wav) | ✓ | ✓ | — | 22.05 kHz |
 | Fish Speech S2 Pro | `fishaudio/s2-pro` | dual-AR | ✓ | ✓ | — | 44.1 kHz |
+| Ming-flash-omni-TTS | `Jonathan1909/Ming-flash-omni-2.0` | single (talker only) | — (caption-controlled) | — | style / IP / basic captions | 44.1 kHz |
+| MOSS-TTS-Nano | `OpenMOSS-Team/MOSS-TTS-Nano` | single (AR + codec) | ✓ (required) | ✓ | voice_clone, continuation | 48 kHz |
 | OmniVoice | `k2-fsa/OmniVoice` | 2 (gen + dec) | ✓ | — | voice design, language hint | 24 kHz |
 | Qwen3-TTS | `Qwen/Qwen3-TTS-12Hz-1.7B-{CustomVoice,VoiceDesign,Base}` | 2 (talker + code2wav) | ✓ (Base) | ✓ | 3 task variants | 24 kHz |
-| VoxCPM | local model dir | split | ✓ | ✓ | — | 24 kHz |
+| VoxCPM | `openbmb/VoxCPM-0.5B` | split | ✓ | ✓ | — | 24 kHz |
 | VoxCPM2 | `openbmb/VoxCPM2` | single (native AR) | ✓ | ✓ (online) | continuation | 48 kHz |
 | Voxtral TTS | `mistralai/Voxtral-4B-TTS-2603` | varies | ✓ | ✓ | voice presets | 24 kHz |
 
@@ -126,6 +128,76 @@ Streaming requires `async_chunk: true` in the stage config.
 
 ---
 
+## Ming-flash-omni-TTS
+
+Standalone talker-only deployment of Ming-flash-omni-2.0 at 44.1 kHz. Voice is controlled through caption fields (`风格` / `IP` / `语速`/`基频`/`音量`) rather than reference audio.
+
+### Prerequisites
+The example calls into `vllm_omni.model_executor.models.ming_flash_omni.prompt_utils` for the default prompt and instruction builder; no extra pip install on top of the base vLLM-Omni install.
+
+### Quick start
+```bash
+python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case style
+```
+
+### Cases
+```bash
+# ASMR-style whisper (caption-driven)
+python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case style
+
+# IP voice (preset character voice via caption)
+python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case ip
+
+# Basic speed/pitch/volume control
+python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case basic
+```
+
+Override the default text per case with `--text`, write to a custom path with `--output`.
+
+### Notes
+- Talker-only deployment — for the multimodal Ming-flash-omni example, see [`examples/offline_inference/ming_flash_omni/`](../../ming_flash_omni/).
+- Deploy config: `vllm_omni/deploy/ming_flash_omni_tts.yaml` (single GPU, `enforce_eager`, `max_num_seqs: 1`).
+- Decode defaults from the Ming cookbook: `max_decode_steps=200`, `cfg=2.0`, `sigma=0.25`, `temperature=0.0`, `use_zero_spk_emb=True`.
+
+---
+
+## MOSS-TTS-Nano
+
+Single-stage 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec at 48 kHz mono (mixed down from upstream stereo). ZH / EN / JA. Every request requires a reference clip via `--ref-audio`.
+
+> **No built-in speaker presets.** `--ref-audio` is required on every call. Default `--mode voice_clone` matches upstream's recommended workflow; `--mode continuation` is exposed for completeness but upstream's continuation-with-prompt path emits very short / near-silent output, so it is rarely useful in practice. Sample reference clips ship in the upstream repo under [`assets/audio/`](https://github.com/OpenMOSS/MOSS-TTS-Nano/tree/main/assets/audio) (e.g. `zh_1.wav`, `en_2.wav`, `jp_2.wav`).
+
+### Quick start
+```bash
+# Fetch a sample reference clip (one-off, user-scoped cache).
+REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
+mkdir -p "$REF_DIR"
+[ -s "$REF_DIR/zh_1.wav" ] || \
+    curl -L -o "$REF_DIR/zh_1.wav" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav
+
+python examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py \
+    --text "你好，这是MOSS-TTS-Nano的语音合成演示。" \
+    --ref-audio "$REF_DIR/zh_1.wav"
+```
+The first run downloads `OpenMOSS-Team/MOSS-TTS-Nano` and `OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano` from Hugging Face.
+
+### Reproducible runs
+```bash
+python examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py \
+    --text "Deterministic test." \
+    --ref-audio "$REF_DIR/en_2.wav" \
+    --seed 42
+```
+
+### Notes
+- Output: 48 kHz mono WAV (the tokenizer is internally stereo at 48 kHz; the wrapper averages to mono before reaching the engine).
+- Deploy config: `vllm_omni/deploy/moss_tts_nano.yaml` (auto-loaded; override with `--deploy-config`).
+- Default `--max-new-frames 375` ≈ 14 s of audio; raise for longer outputs.
+- `--ref-text` is rejected in `voice_clone` mode and required only with `--mode continuation`.
+- Run `--help` for the full sampling-knob surface (`--audio-temperature`, `--audio-top-k`, `--audio-top-p`, `--text-temperature`).
+
+---
+
 ## OmniVoice
 
 Zero-shot multilingual TTS supporting 600+ languages, with three modes (auto / clone / design).
diff --git a/examples/offline_inference/ming_flash_omni_tts/end2end.py b/examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py
similarity index 100%
rename from examples/offline_inference/ming_flash_omni_tts/end2end.py
rename to examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py
diff --git a/examples/offline_inference/moss_tts_nano/end2end.py b/examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py
similarity index 92%
rename from examples/offline_inference/moss_tts_nano/end2end.py
rename to examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py
index 1d2b21043be..3a27ad04dea 100644
--- a/examples/offline_inference/moss_tts_nano/end2end.py
+++ b/examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py
@@ -9,20 +9,20 @@
 
 MOSS-TTS-Nano upstream supports two modes (matching ``infer.py``):
 
-* ``voice_clone`` (recommended): only ``--prompt-audio`` is required.
-* ``continuation``: ``--prompt-audio`` + ``--prompt-text`` together.
+* ``voice_clone`` (recommended): only ``--ref-audio`` is required.
+* ``continuation``: ``--ref-audio`` + ``--ref-text`` together.
 
 Usage:
   # Voice clone (recommended): ref audio only, no transcript needed.
   python end2end.py \\
     --text "Hello!" \\
-    --prompt-audio /path/to/ref.wav
+    --ref-audio /path/to/ref.wav
 
   # Continuation: ref audio + its transcript.
   python end2end.py \\
     --text "Hello!" \\
-    --prompt-audio /path/to/ref.wav \\
-    --prompt-text "Transcript of the reference clip." \\
+    --ref-audio /path/to/ref.wav \\
+    --ref-text "Transcript of the reference clip." \\
     --mode continuation
 
   # Sample reference clips ship in the upstream repo:
@@ -120,11 +120,11 @@ def main(args) -> None:
     output_dir.mkdir(parents=True, exist_ok=True)
 
     print(f"Synthesizing: {args.text!r}")
-    print(f"  ref_audio: {args.prompt_audio}")
+    print(f"  ref_audio: {args.ref_audio}")
     inputs = build_request(
         text=args.text,
-        prompt_audio_path=args.prompt_audio,
-        prompt_text=args.prompt_text,
+        prompt_audio_path=args.ref_audio,
+        prompt_text=args.ref_text,
         mode=args.mode,
         max_new_frames=args.max_new_frames,
         seed=args.seed,
@@ -158,15 +158,15 @@ def parse_args():
     parser = FlexibleArgumentParser(description="MOSS-TTS-Nano offline inference")
     parser.add_argument("--text", default="Hello, this is MOSS-TTS-Nano speaking.", help="Text to synthesize.")
     parser.add_argument(
-        "--prompt-audio",
+        "--ref-audio",
         required=True,
         help="Path to reference audio for voice cloning / continuation (required).",
     )
     parser.add_argument(
-        "--prompt-text",
+        "--ref-text",
         default=None,
         help=(
-            "Optional transcript of --prompt-audio. Required (and only meaningful) "
+            "Optional transcript of --ref-audio. Required (and only meaningful) "
             "in --mode continuation; rejected by upstream in --mode voice_clone."
         ),
     )
diff --git a/examples/offline_inference/voxtral_tts/README.md b/examples/offline_inference/voxtral_tts/README.md
deleted file mode 100644
index bbe317798a8..00000000000
--- a/examples/offline_inference/voxtral_tts/README.md
+++ /dev/null
@@ -1,58 +0,0 @@
-# Voxtral TTS Offline Inference
-
-`end2end.py` runs Voxtral TTS end-to-end offline inference using vLLM. It supports both blocking (`Omni`) and streaming (`AsyncOmni`) generation, batched prompts with configurable concurrency, and voice selection via preset name or reference audio file.
-
-When `mistral_common` has `SpeechRequest` support, prompt token IDs are built via `encode_speech_request`. Otherwise, the script falls back to manual token construction.
-
-## Usage Examples
-
-
-```bash
-# Basic single-prompt with cheerful_female voice preset
-python3 examples/offline_inference/voxtral_tts/end2end.py \
-    --write-audio --voice cheerful_female \
-    --model mistralai/Voxtral-4B-TTS-2603 \
-    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
-
-# 32 replicate prompts with cheerful_female voice preset
-python3 examples/offline_inference/voxtral_tts/end2end.py \
-    --num-prompts 32 --write-audio --voice cheerful_female \
-    --model mistralai/Voxtral-4B-TTS-2603 \
-    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
-
-# Streaming with neutral_female voice preset
-python3 examples/offline_inference/voxtral_tts/end2end.py \
-    --streaming --write-audio --voice neutral_female \
-    --model mistralai/Voxtral-4B-TTS-2603 \
-    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
-
-# 32 prompts, 8 concurrent requests per wave, streaming with neutral_female voice
-python3 examples/offline_inference/voxtral_tts/end2end.py \
-    --num-prompts 32 --concurrency 8 --streaming --write-audio --voice neutral_female \
-    --model mistralai/Voxtral-4B-TTS-2603 \
-    --text "That eerie silence after the first storm was just the calm before another round of chaos, wasn't it?"
-
-# Short debug prompt with reference audio
-python3 examples/offline_inference/voxtral_tts/end2end.py \
-    --write-audio \
-    --model mistralai/Voxtral-4B-TTS-2603 \
-    --text "This is a test message." \
-    --audio-path path/to/reference_audio.wav
-```
-
-## Arguments
-
-| Argument | Description |
-|---|---|
-| `--model PATH` | HuggingFace repo ID or local directory path (default: `mistralai/Voxtral-4B-TTS-2603`) |
-| `--text TEXT` | Text to synthesize (default: `"This is a test message."`) |
-| `--audio-path PATH` | Path to reference audio file for voice cloning |
-| `--output-dir DIR` | Directory to write output WAV files (default: `output_audio`) |
-| `--deploy-config PATH` | Override the deploy config path. If unset, auto-loads `vllm_omni/deploy/voxtral_tts.yaml` from the HF `model_type`. |
-| `--num-prompts N` | Number of replicate prompts to run for measuring performance (default: 1) |
-| `--streaming` | Use streaming generation via `AsyncOmni` (default: blocking `Omni`) |
-| `--concurrency N` | Max concurrent requests per wave (must be used with `--streaming`, must evenly divide `--num-prompts`) |
-| `--voice NAME` | Voice preset to use instead of reference audio file. Check Huggingface `mistralai/Voxtral-4B-TTS-2603` to get the list of available voices |
-| `--write-audio` | Write generated audio to WAV files |
-| `--profiling-mode` | Enable profiling mode (reduces max tokens to 50) |
-| `--log-stats` | Enable detailed statistics logging |
diff --git a/examples/online_serving/ming_flash_omni/README.md b/examples/online_serving/ming_flash_omni/README.md
index dd8c8aa8186..533a30e1cb0 100644
--- a/examples/online_serving/ming_flash_omni/README.md
+++ b/examples/online_serving/ming_flash_omni/README.md
@@ -11,7 +11,7 @@ Please refer to [README.md](../../../README.md)
 | Thinker + Talker (omni-speech, default) | `vllm serve ... --omni` | Text + Audio |
 | Thinker only (multimodal understanding) | `vllm serve ... --omni --deploy-config vllm_omni/deploy/ming_flash_omni_thinker_only.yaml` | Text |
 
-For standalone TTS (talker only), see [`examples/online_serving/ming_flash_omni_tts/`](../ming_flash_omni_tts/).
+For standalone TTS (talker only), see the [Ming-flash-omni-TTS section in the Text-To-Speech hub](../text_to_speech/README.md#ming-flash-omni-tts).
 
 ## Run examples (Ming-flash-omni 2.0)
 
diff --git a/examples/online_serving/ming_flash_omni_tts/README.md b/examples/online_serving/ming_flash_omni_tts/README.md
deleted file mode 100644
index 1b372e3897e..00000000000
--- a/examples/online_serving/ming_flash_omni_tts/README.md
+++ /dev/null
@@ -1,54 +0,0 @@
-# Ming-flash-omni Standalone TTS (Online Serving)
-
-This directory contains online e2e examples for **Ming-flash-omni-2.0 standalone talker deployment**.
-
-Server uses:
-
-- `model`: `Jonathan1909/Ming-flash-omni-2.0`
-- `deploy config`: `vllm_omni/deploy/ming_flash_omni_tts.yaml`
-
-## Launch the Server
-
-```bash
-# from repo root
-bash examples/online_serving/ming_flash_omni_tts/run_server.sh
-```
-
-Equivalent manual command:
-
-```bash
-vllm serve Jonathan1909/Ming-flash-omni-2.0 \
-    --deploy-config vllm_omni/deploy/ming_flash_omni_tts.yaml \
-    --host 0.0.0.0 \
-    --port 8091 \
-    --trust-remote-code \
-    --omni
-```
-
-## Send TTS Request
-
-Python client:
-
-```bash
-python examples/online_serving/ming_flash_omni_tts/speech_client.py \
-    --text "我们当迎着阳光辛勤耕作，去摘取，去制作，去品尝，去馈赠。" \
-    --output ming_online.wav
-```
-
-Long-form `instructions` (e.g. ASMR whisper style) via the client:
-
-```bash
-python examples/online_serving/ming_flash_omni_tts/speech_client.py \
-    --text "我会一直在这里陪着你，直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗？" \
-    --instructions "这是一种ASMR耳语，属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语，声音气音成分重。音量极低，紧贴麦克风，语速极慢，旨在制造触发听者颅内快感的声学刺激。" \
-    --output ming_online_asmr.wav
-```
-
-## Notes
-
-- This is the **online serving** counterpart of `examples/offline_inference/ming_flash_omni_tts/`.
-- The server uses `use_zero_spk_emb=True` and the default decode args
-  (`max_decode_steps=200`, `cfg=2.0`, `sigma=0.25`, `temperature=0.0`).
-  For other caption fields (`语速`, `基频`, `IP`, BGM, etc.) or overriding
-  decode args, use the offline e2e example where `additional_information`
-  is set explicitly.
diff --git a/examples/online_serving/moss_tts_nano/README.md b/examples/online_serving/moss_tts_nano/README.md
deleted file mode 100644
index b6c47322520..00000000000
--- a/examples/online_serving/moss_tts_nano/README.md
+++ /dev/null
@@ -1,147 +0,0 @@
-# MOSS-TTS-Nano
-
-## Model checkpoint
-
-| Model | Description |
-|-------|-------------|
-| `OpenMOSS-Team/MOSS-TTS-Nano` | 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec, 48 kHz mono (mixed down from upstream stereo), ZH/EN/JA |
-
-> **No built-in speaker presets.** Every request must include `ref_audio`.
-> The server uses upstream's recommended `voice_clone` mode (per
-> upstream's README and `infer.py` example). The OpenAI-schema `voice`
-> and `ref_text` fields are accepted but ignored — `voice_clone` does
-> not consume a transcript, and upstream's `continuation` mode (the only
-> path that accepts `prompt_text`) emits near-silent output with a
-> reference clip + transcript pair, so it is not exposed here.
->
-> Sample reference clips are available in the upstream repo under
-> [`assets/audio/`](https://github.com/OpenMOSS/MOSS-TTS-Nano/tree/main/assets/audio)
-> (e.g. `zh_1.wav`, `en_2.wav`, `jp_2.wav`).
-
-## Gradio Demo
-
-An interactive Gradio demo is available with custom voice cloning and
-streaming support. Upload your own reference audio in the UI.
-
-```bash
-# Option 1: Launch server + Gradio together
-./run_gradio_demo.sh
-
-# Option 2: If server is already running
-python gradio_demo.py --api-base http://localhost:8091
-```
-
-Then open http://localhost:7860 in your browser.
-
-## Launch the Server
-
-```bash
-vllm serve OpenMOSS-Team/MOSS-TTS-Nano --omni --port 8091
-```
-
-The deploy config at `vllm_omni/deploy/moss_tts_nano.yaml` auto-loads; no
-`--stage-configs-path`, `--trust-remote-code`, or `--enforce-eager` flags
-are needed.
-
-Or use the convenience script:
-
-```bash
-./run_server.sh
-```
-
-## Send TTS Request
-
-Every request needs `ref_audio` (base64 data URL). Reuse a saved sample:
-
-```bash
-# Fetch a sample reference clip from the upstream repo (one-off).
-# Cache under XDG_CACHE_HOME so it survives across runs and stays user-scoped.
-REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
-mkdir -p "$REF_DIR"
-REF_WAV="$REF_DIR/zh_1.wav"
-[ -s "$REF_WAV" ] || curl -L -o "$REF_WAV" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav
-REF_AUDIO=$(base64 -w 0 "$REF_WAV")
-
-curl -X POST http://localhost:8091/v1/audio/speech \
-    -H "Content-Type: application/json" \
-    -d "{
-        \"input\": \"你好，这是语音合成测试。\",
-        \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\",
-        \"response_format\": \"wav\"
-    }" --output output.wav
-```
-
-### Using Python
-
-```python
-import base64
-import os
-import urllib.request
-from pathlib import Path
-
-import httpx
-
-ref_dir = Path(os.environ.get("XDG_CACHE_HOME", Path.home() / ".cache")) / "moss-tts-nano"
-ref_dir.mkdir(parents=True, exist_ok=True)
-ref_wav = ref_dir / "zh_1.wav"
-if not ref_wav.exists() or ref_wav.stat().st_size == 0:
-    urllib.request.urlretrieve(
-        "https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav",
-        ref_wav,
-    )
-
-with ref_wav.open("rb") as f:
-    ref_audio_b64 = base64.b64encode(f.read()).decode("ascii")
-
-response = httpx.post(
-    "http://localhost:8091/v1/audio/speech",
-    json={
-        "input": "你好，这是语音合成测试。",
-        "ref_audio": f"data:audio/wav;base64,{ref_audio_b64}",
-        "response_format": "wav",
-    },
-    timeout=300.0,
-)
-
-with open("output.wav", "wb") as f:
-    f.write(response.content)
-```
-
-### Streaming
-
-```bash
-curl -X POST http://localhost:8091/v1/audio/speech \
-    -H "Content-Type: application/json" \
-    -d "{
-        \"input\": \"Hello, streaming output from MOSS-TTS-Nano.\",
-        \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\",
-        \"stream\": true,
-        \"response_format\": \"pcm\"
-    }" --no-buffer | play -t raw -r 48000 -e signed -b 16 -c 1 -
-```
-
-**Note:** Output is 48 kHz mono PCM. Upstream's audio tokenizer is internally stereo at 48 kHz; the model wrapper averages the two channels into mono before reaching the engine, so playback duration / pitch are correct against the WAV header's 48 kHz rate.
-
-## API Parameters
-
-MOSS-TTS-Nano uses the standard `/v1/audio/speech` endpoint.
-
-| Parameter | Type | Default | Description |
-|-----------|------|---------|-------------|
-| `input` | string | **required** | Text to synthesize (ZH / EN / JA) |
-| `ref_audio` | string | **required** | Base64 data URL of the reference audio clip |
-| `ref_text` | string | accepted, ignored | Schema-compatible field; voice_clone mode does not consume a transcript |
-| `response_format` | string | `"wav"` | Audio format: wav, mp3, flac, pcm |
-| `stream` | bool | false | Stream raw PCM chunks |
-| `max_new_tokens` | int | 4096 | Maximum tokens to generate |
-
-The `voice` and `ref_text` fields from the OpenAI schema are accepted but
-ignored — there are no built-in speaker presets in MOSS-TTS-Nano, and
-upstream's voice_clone mode does not consume a transcript.
-
-## Troubleshooting
-
-1. **`libnvrtc.so.13: cannot open shared object file`**: torchaudio 2.10+ defaults to torchcodec which requires NVRTC. vLLM-Omni patches this automatically at model load time to use soundfile instead.
-2. **Connection refused**: Ensure the server is running on the correct port.
-3. **Flashinfer version mismatch**: Set `FLASHINFER_DISABLE_VERSION_CHECK=1` if you see version warnings.
-4. **Out of memory**: The default `gpu_memory_utilization=0.3` is conservative. Increase it in the stage config if you have more VRAM available.
diff --git a/examples/online_serving/text_to_speech/README.md b/examples/online_serving/text_to_speech/README.md
index 836b8ed7ac8..8481e723ef9 100644
--- a/examples/online_serving/text_to_speech/README.md
+++ b/examples/online_serving/text_to_speech/README.md
@@ -15,9 +15,11 @@ For the full list of supported architectures across all modalities, see
 | Model | HuggingFace repo | Voice cloning | Streaming | Voice presets / upload | Gradio demo |
 |---|---|---|---|---|---|
 | Fish Speech S2 Pro | `fishaudio/s2-pro` | ✓ (`ref_audio`+`ref_text`) | ✓ (PCM stream) | — | ✓ |
+| Ming-flash-omni-TTS | `Jonathan1909/Ming-flash-omni-2.0` | — (caption-controlled) | — | caption fields (`instructions`) | — |
+| MOSS-TTS-Nano | `OpenMOSS-Team/MOSS-TTS-Nano` | ✓ (`ref_audio` required) | ✓ (PCM stream) | — | ✓ |
 | OmniVoice | `k2-fsa/OmniVoice` | (offline only) | — | — | — |
 | Qwen3-TTS | `Qwen/Qwen3-TTS-12Hz-1.7B-{CustomVoice,VoiceDesign,Base}` | ✓ (Base) | ✓ (PCM + WebSocket) | ✓ (presets + `/v1/audio/voices` upload) | ✓ (standard + FastRTC) |
-| VoxCPM | local model dir | ✓ | ✓ (PCM stream) | — | — |
+| VoxCPM | `openbmb/VoxCPM-0.5B` | ✓ | ✓ (PCM stream) | — | — |
 | VoxCPM2 | `openbmb/VoxCPM2` | ✓ | ✓ (AudioWorklet via gradio) | — | ✓ |
 | Voxtral TTS | `mistralai/Voxtral-4B-TTS-2603` | ✓ (gated upstream) | ✓ | ✓ (presets) | ✓ |
 
@@ -141,6 +143,105 @@ python fish_speech/gradio_demo.py --api-base http://localhost:8091  # if server
 
 ---
 
+## Ming-flash-omni-TTS
+
+Standalone talker-only deployment of Ming-flash-omni-2.0. Voice is controlled through caption text passed via `instructions`.
+
+### Launch
+```bash
+# from repo root
+bash examples/online_serving/text_to_speech/ming_flash_omni_tts/run_server.sh
+```
+Equivalent manual command:
+```bash
+vllm serve Jonathan1909/Ming-flash-omni-2.0 \
+    --deploy-config vllm_omni/deploy/ming_flash_omni_tts.yaml \
+    --host 0.0.0.0 --port 8091 \
+    --trust-remote-code --omni
+```
+
+### Sending requests
+```bash
+python examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py \
+    --text "我们当迎着阳光辛勤耕作，去摘取，去制作，去品尝，去馈赠。" \
+    --output ming_online.wav
+```
+
+ASMR-style caption via `instructions`:
+```bash
+python examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py \
+    --text "我会一直在这里陪着你，直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗？" \
+    --instructions "这是一种ASMR耳语，属于一种旨在引发特殊感官体验的创意风格。这个女性使用轻柔的普通话进行耳语，声音气音成分重。" \
+    --output ming_online_asmr.wav
+```
+
+### Notes
+- Server uses `use_zero_spk_emb=True` and the cookbook decode defaults (`max_decode_steps=200`, `cfg=2.0`, `sigma=0.25`, `temperature=0.0`). For other caption fields (`语速`, `基频`, `IP`, BGM, etc.) or overriding decode args, use the offline example where `additional_information` is set explicitly.
+- This is the online counterpart of [`examples/offline_inference/text_to_speech/ming_flash_omni_tts/`](../../offline_inference/text_to_speech/ming_flash_omni_tts/).
+- For multimodal Ming-flash-omni online serving, see [`examples/online_serving/ming_flash_omni/`](../../ming_flash_omni/).
+
+---
+
+## MOSS-TTS-Nano
+
+Single-stage 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec at 48 kHz mono. Every request must include `ref_audio`; there are no built-in speaker presets.
+
+> The OpenAI-schema `voice` and `ref_text` fields are accepted but ignored — `voice_clone` does not consume a transcript, and upstream's `continuation` mode (the only path that accepts `prompt_text`) emits near-silent output, so it is not exposed here. Sample reference clips ship in the upstream repo under [`assets/audio/`](https://github.com/OpenMOSS/MOSS-TTS-Nano/tree/main/assets/audio).
+
+### Launch
+```bash
+vllm serve OpenMOSS-Team/MOSS-TTS-Nano --omni --port 8091
+# or:
+./moss_tts_nano/run_server.sh
+```
+The deploy config at `vllm_omni/deploy/moss_tts_nano.yaml` auto-loads; no `--stage-configs-path`, `--trust-remote-code`, or `--enforce-eager` flags are needed.
+
+### Sending requests
+```bash
+# One-off fetch of a sample reference clip; cache under XDG_CACHE_HOME.
+REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
+mkdir -p "$REF_DIR"
+REF_WAV="$REF_DIR/zh_1.wav"
+[ -s "$REF_WAV" ] || curl -L -o "$REF_WAV" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav
+REF_AUDIO=$(base64 -w 0 "$REF_WAV")
+
+curl -X POST http://localhost:8091/v1/audio/speech \
+    -H "Content-Type: application/json" \
+    -d "{
+        \"input\": \"你好，这是语音合成测试。\",
+        \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\",
+        \"response_format\": \"wav\"
+    }" --output output.wav
+```
+
+### Streaming PCM
+```bash
+curl -X POST http://localhost:8091/v1/audio/speech \
+    -H "Content-Type: application/json" \
+    -d "{
+        \"input\": \"Hello, streaming output from MOSS-TTS-Nano.\",
+        \"ref_audio\": \"data:audio/wav;base64,${REF_AUDIO}\",
+        \"stream\": true,
+        \"response_format\": \"pcm\"
+    }" --no-buffer | play -t raw -r 48000 -e signed -b 16 -c 1 -
+```
+
+### Gradio demo
+```bash
+# Option 1: launch server + Gradio together
+./moss_tts_nano/run_gradio_demo.sh
+
+# Option 2: server already running
+python moss_tts_nano/gradio_demo.py --api-base http://localhost:8091
+```
+Then open http://localhost:7860 in your browser.
+
+### Notes
+- Output is 48 kHz mono PCM (the upstream tokenizer is internally stereo at 48 kHz; the wrapper averages to mono before reaching the engine).
+- Standard `/v1/audio/speech` request shape: `input`, `ref_audio` (base64 data URL), `response_format`, `stream`, `max_new_tokens`. The `voice` and `ref_text` fields from the OpenAI schema are accepted but ignored.
+
+---
+
 ## OmniVoice
 
 Zero-shot multilingual TTS (600+ languages). Online serving currently exposes **auto voice** only; voice cloning and voice design are available offline.
diff --git a/examples/online_serving/ming_flash_omni_tts/run_server.sh b/examples/online_serving/text_to_speech/ming_flash_omni_tts/run_server.sh
similarity index 100%
rename from examples/online_serving/ming_flash_omni_tts/run_server.sh
rename to examples/online_serving/text_to_speech/ming_flash_omni_tts/run_server.sh
diff --git a/examples/online_serving/ming_flash_omni_tts/speech_client.py b/examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py
similarity index 100%
rename from examples/online_serving/ming_flash_omni_tts/speech_client.py
rename to examples/online_serving/text_to_speech/ming_flash_omni_tts/speech_client.py
diff --git a/examples/online_serving/moss_tts_nano/gradio_demo.py b/examples/online_serving/text_to_speech/moss_tts_nano/gradio_demo.py
similarity index 100%
rename from examples/online_serving/moss_tts_nano/gradio_demo.py
rename to examples/online_serving/text_to_speech/moss_tts_nano/gradio_demo.py
diff --git a/examples/online_serving/moss_tts_nano/run_gradio_demo.sh b/examples/online_serving/text_to_speech/moss_tts_nano/run_gradio_demo.sh
similarity index 100%
rename from examples/online_serving/moss_tts_nano/run_gradio_demo.sh
rename to examples/online_serving/text_to_speech/moss_tts_nano/run_gradio_demo.sh
diff --git a/examples/online_serving/moss_tts_nano/run_server.sh b/examples/online_serving/text_to_speech/moss_tts_nano/run_server.sh
similarity index 100%
rename from examples/online_serving/moss_tts_nano/run_server.sh
rename to examples/online_serving/text_to_speech/moss_tts_nano/run_server.sh
diff --git a/recipes/inclusionAI/Ming-flash-omni-2.0.md b/recipes/inclusionAI/Ming-flash-omni-2.0.md
index 124b94c3872..71a90cbe7c8 100644
--- a/recipes/inclusionAI/Ming-flash-omni-2.0.md
+++ b/recipes/inclusionAI/Ming-flash-omni-2.0.md
@@ -22,9 +22,12 @@ Use this recipe when you want a known-good starting point for serving
 
 - Upstream model:
   [`inclusionAI/Ming`](https://github.com/inclusionAI/Ming)
-- For offline inference and additional client variants, see
-  `examples/offline_inference/ming_flash_omni{,_tts}/` and
-  `examples/online_serving/ming_flash_omni{,_tts}/`.
+- For offline inference and additional client variants, see the
+  multimodal example dirs `examples/offline_inference/ming_flash_omni/` and
+  `examples/online_serving/ming_flash_omni/`. The standalone TTS variant
+  lives under the consolidated text-to-speech hub at
+  `examples/offline_inference/text_to_speech/ming_flash_omni_tts/` and
+  `examples/online_serving/text_to_speech/ming_flash_omni_tts/`.
 
 
 ## Hardware Support