vllm-project · akshatvishu · Apr 18, 2026 · Apr 18, 2026 · Apr 18, 2026 · Apr 23, 2026
@@ -58,6 +58,7 @@ th {
 | `Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-CustomVoice | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-VoiceDesign | `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-Base | `Qwen/Qwen3-TTS-12Hz-0.6B-Base` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
+| `MingTTSForConditionalGeneration` | Ming-omni-tts dense 0.5B | `inclusionAI/Ming-omni-tts-0.5B` | ✅︎ | | | |
 | `GLMTTSForConditionalGeneration` | GLM-TTS | `zai-org/GLM-TTS` | ✅︎ | | | |
 | `NextStep11Pipeline` | NextStep-1.1 | `stepfun-ai/NextStep-1.1` | ✅︎ | ✅︎ | | ✅︎ |
 | `MiMoAudioModel` | MiMo-Audio-7B-Instruct | `XiaomiMiMo/MiMo-Audio-7B-Instruct` | ✅︎ | ✅︎ | | |

@@ -0,0 +1,140 @@
+# Ming-omni-tts
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/text_to_speech/ming_tts>.
+
+This directory contains an offline Ming example that uses the in-repo Ming prompt builder directly. It covers the broader upstream dense 0.5B surface: style, IP, music-only generation, TTA, emotion, dialect, zero-shot clone, podcast, speech+bgm, and speech+sound.
+
+## Quick Start
+
+Run a zero-speaker style case:
+
+```bash
+python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
+    --case style \
+    --deploy-config vllm_omni/deploy/ming_tts.yaml \
+    --enforce-eager
+```
+
+Run emotion-controlled speech:
+
+```bash
+python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
+    --case emotion \
+    --ref-audio /path/to/emotion_prompt.wav \
+    --deploy-config vllm_omni/deploy/ming_tts.yaml \
+    --enforce-eager
+```
+
+Run zero-shot cloning with a transcript:
+
+```bash
+python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
+    --case zero_shot \
+    --ref-audio /path/to/reference.wav \
+    --ref-text "在此奉劝大家别乱打美白针。" \
+    --deploy-config vllm_omni/deploy/ming_tts.yaml \
+    --enforce-eager
+```
+
+Run podcast generation:
+
+```bash
+python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
+    --case podcast \
+    --ref-audio-paths /path/to/CTS-CN-F2F-2019-11-11-423-012-A.wav /path/to/CTS-CN-F2F-2019-11-11-423-012-B.wav \
+    --deploy-config vllm_omni/deploy/ming_tts.yaml \
+    --enforce-eager
+```
+
+Run text-to-audio event generation:
+
+```bash
+python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
+    --case tta \
+    --deploy-config vllm_omni/deploy/ming_tts.yaml \
+    --enforce-eager
+```
+
+Run with stats and a manifest:
+
+```bash
+python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
+    --case style \
+    --deploy-config vllm_omni/deploy/ming_tts.yaml \
+    --enforce-eager \
+    --enable-stats \
+    --stats-log-file output_audio/ming_style_pipeline.log \
+    --metadata-json output_audio/ming_style_manifest.json
+```
+
+## Built-in Cases
+
+- `style`: zero-speaker style-conditioned speech
+- `ip`: zero-speaker IP voice generation
+- `bgm`: music generation
+- `tta`: text-to-audio event generation with FlowLoss controls
+- `emotion`: reference-audio speech with emotion control
+- `basic`: reference-audio cloning with speed / pitch / volume control
+- `dialect`: reference-audio cloning with dialect control
+- `zero_shot`: reference-audio cloning with explicit transcript
+- `podcast`: multi-reference dialogue generation with automatic speaker embedding extraction
+- `speech_bgm`: speech with background music conditioning
+- `speech_sound`: speech with environment sound conditioning
+
+## Streaming
+
+Use async_chunk streaming with `AsyncOmni`:
+
+```bash
+python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
+    --case basic \
+    --ref-audio /path/to/10002287-00000095.wav \
+    --streaming \
+    --deploy-config vllm_omni/deploy/ming_tts.yaml \
+    --enforce-eager
+```
+
+`--streaming` currently supports one prompt per process invocation. Use
+blocking mode for `--num-prompts > 1`.
+
+## Validation matrix
+
+The example is intended to cover the dense TTS workflows used by the Ming
+validation helper:
+
+| Case | Blocking | Async chunk | Extra inputs |
+|---|---:|---:|---|
+| `style` | Yes | Optional smoke test | none |
+| `ip` | Yes | Optional smoke test | none |
+| `bgm` | Yes | Optional smoke test | none |
+| `tta` | Yes | Optional smoke test | none |
+| `emotion` | Yes | Yes | reference WAV |
+| `basic` | Yes | Yes | reference WAV |
+| `dialect` | Yes | Yes | reference WAV |
+| `zero_shot` | Yes | Yes | reference WAV and transcript |
+| `podcast` | Yes | Yes | two reference WAVs |
+| `speech_bgm` | Yes | Yes | reference WAV |
+| `speech_sound` | Yes | Yes | reference WAV |
+
+The offline example also exposes vLLM-Omni runtime/reporting controls such as:
+
+- `--num-prompts`
+- `--enable-stats`
+- `--stats-log-file`
+- `--metadata-json`
+- `--stage-init-timeout`
+- `--init-timeout`
+- `--batch-timeout`
+- `--worker-backend`
+- `--ray-address`
+
+## Example materials
+
+??? abstract "README.md"
+    ``````md
+    --8<-- "examples/offline_inference/text_to_speech/ming_tts/README.md"
+    ``````
+??? abstract "end2end.py"
+    ``````py
+    --8<-- "examples/offline_inference/text_to_speech/ming_tts/end2end.py"
+    ``````
@@ -0,0 +1,186 @@
+# Ming-omni-tts
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/text_to_speech/ming_tts>.
+
+This example shows how to serve Ming through the OpenAI-compatible `/v1/audio/speech` endpoint. The server builds Ming prompts directly with the in-repo prompt builder, so online requests support Ming-specific structured controls instead of the Qwen placeholder path.
+
+## Installation
+
+Please refer to [README.md](https://github.com/vllm-project/vllm-omni/tree/main/README.md)
+
+## Launch the Server
+
+```bash
+vllm-omni serve inclusionAI/Ming-omni-tts-0.5B \
+    --deploy-config vllm_omni/deploy/ming_tts.yaml \
+    --omni \
+    --port 8091 \
+    --enforce-eager
+```
+
+Or:
+
+```bash
+cd examples/online_serving/text_to_speech/ming_tts
+./run_server.sh
+```
+
+The canonical Ming online client is `openai_speech_client.py`. It targets the
+local vLLM-Omni server, not OpenAI's cloud API, so `api_key=EMPTY` is enough
+for local testing.
+
+## Example Requests
+
+Basic TTS:
+
+```bash
+python openai_speech_client.py \
+    --text "你好，这是 Ming 在线语音合成测试。"
+```
+
+Style-conditioned speech:
+
+```bash
+python openai_speech_client.py \
+    --text "我会一直在这里陪着你。" \
+    --instructions "轻柔的ASMR耳语，慢速，贴近麦克风"
+```
+
+Structured Ming control:
+
+```bash
+python openai_speech_client.py \
+    --text "我觉得社会企业同个人都有责任" \
+    --instruction-json '{"方言":"广粤话"}'
+```
+
+IP voice generation:
+
+```bash
+python openai_speech_client.py \
+    --text "这款产品的名字，叫变态坑爹牛肉丸。" \
+    --voice 灵小甄
+```
+
+Reference-audio cloning:
+
+Use `ref_audio` by itself for Ming prompt-waveform conditioning. Add
+`ref_text` when the request is transcript cloning, such as zero-shot or
+podcast-style prompts.
+
+```bash
+python openai_speech_client.py \
+    --task-type Base \
+    --text "我们的愿景是构建未来服务业的数字化基础设施。" \
+    --ref-audio /path/to/reference.wav \
+    --ref-text "在此奉劝大家别乱打美白针。"
+```
+
+Speaker-embedding cloning:
+
+```bash
+python openai_speech_client.py \
+    --task-type Base \
+    --text "你好，这是一段使用说话人向量的合成语音。" \
+    --speaker-embedding /path/to/ming_speaker_embedding.json
+```
+
+Streaming PCM:
+
+```bash
+python openai_speech_client.py \
+    --text "你好，这是流式输出测试。" \
+    --instructions "平静，普通话" \
+    --stream \
+    --output ming_output.pcm
+```
+
+## Curl Helper
+
+Use the bundled helper for common request types:
+
+```bash
+./run_curl.sh basic
+./run_curl.sh style
+./run_curl.sh ip
+REF_AUDIO=/path/to/emotion_prompt.wav ./run_curl.sh emotion
+REF_AUDIO=/path/to/yue_prompt.wav ./run_curl.sh dialect
+REF_AUDIO=/path/to/reference.wav REF_TEXT="在此奉劝大家别乱打美白针。" ./run_curl.sh zero_shot
+REF_AUDIO=/path/to/speaker_1.wav REF_AUDIO_2=/path/to/speaker_2.wav REF_TEXT="speaker_1:你好。 speaker_2:你好。" ./run_curl.sh podcast
+REF_AUDIO=/path/to/00000309-00000300.wav ./run_curl.sh speech_bgm
+REF_AUDIO=/path/to/00000309-00000300.wav ./run_curl.sh speech_sound
+REF_AUDIO=/path/to/reference.wav REF_TEXT="在此奉劝大家别乱打美白针。" ./run_curl.sh clone_ref_audio
+SPEAKER_EMBEDDING=/path/to/ming_speaker_embedding.json ./run_curl.sh clone_embedding
+./run_curl.sh stream
+```
+
+## Audio Inputs
+
+- `ref_audio` accepts a local path, remote URL, or `data:` URL
+- The Python client converts local files into a base64 `data:` URL
+- `speaker_embedding` must be a JSON file with exactly 192 numeric values
+- Ming prompt-waveform cases can use `ref_audio` without `ref_text`
+- Zero-shot and podcast-style transcript cloning should include `ref_text`
+
+The bundled `run_curl.sh basic` mode is plain/default TTS and does not require
+`REF_AUDIO`. The upstream cookbook-style `basic` case uses `ref_audio` plus
+structured speed / pitch / volume instructions.
+
+## Request Types
+
+Ming online serving supports these request families through `/v1/audio/speech`:
+
+| Case | Online support | Required fields |
+|------|----------------|-----------------|
+| default TTS | Supported | `input`, `max_new_tokens=200` |
+| `style` | Supported | `input`, `instructions`, `max_new_tokens=200` |
+| `ip` | Supported | `input`, `voice`, `max_new_tokens=200` |
+| `basic` helper | Supported | `input`, `max_new_tokens=200` |
+| upstream `basic` case | Supported | `input`, `ref_audio`, structured speed / pitch / volume `instructions`, `max_new_tokens=200` |
+| `emotion` | Supported | `input`, `ref_audio`, structured emotion `instructions`, `max_new_tokens=200` |
+| `dialect` | Supported | `input`, `language` or structured `instructions`, `ref_audio`, `max_new_tokens=200` |
+| `zero_shot` | Supported | `input`, `ref_audio`, `ref_text`, `max_new_tokens=200` |
+| `podcast` | Supported | `input`, repeated/list `ref_audio`, `ref_text`, `max_new_tokens=200` |
+| `speech_bgm` | Supported | `input`, `ref_audio`, structured `instructions` with `{"BGM": ...}`, `max_new_tokens=200` |
+| `speech_sound` | Supported | `input`, `ref_audio`, structured `instructions` with `{"BGM": {"ENV": ...}}`, `max_new_tokens=200` |
+| `bgm` | Not supported online | Requires a future `prompt_mode=music` API extension |
+| `tta` | Not supported online | Requires a future `prompt_mode=tta` API extension |
+
+The online endpoint is speech-shaped today. Music-only `bgm` and text-to-audio
+`tta` remain offline workflows.
+
+## Field Mapping
+
+For Ming, the generic OpenAI request fields map to Ming controls like this:
+
+- `input` -> target text
+- `instructions` -> Ming instruction string, or a JSON string for the structured Ming control object
+- `voice` -> Ming `IP`
+- `language` -> Ming `方言`
+- `ref_audio` -> Ming prompt waveform
+- `ref_text` -> optional transcript for zero-shot and podcast-style cloning
+- `speaker_embedding` -> 192-d Ming speaker embedding
+
+## Voice Listing
+
+- `/v1/audio/voices` lists uploaded voices for Ming.
+- Built-in Ming IP labels can still be used as `voice`, but they are not enumerated by the API.
+
+## Example materials
+
+??? abstract "README.md"
+    ``````md
+    --8<-- "examples/online_serving/text_to_speech/ming_tts/README.md"
+    ``````
+??? abstract "run_server.sh"
+    ``````sh
+    --8<-- "examples/online_serving/text_to_speech/ming_tts/run_server.sh"
+    ``````
+??? abstract "openai_speech_client.py"
+    ``````py
+    --8<-- "examples/online_serving/text_to_speech/ming_tts/openai_speech_client.py"
+    ``````
+??? abstract "run_curl.sh"
+    ``````sh
+    --8<-- "examples/online_serving/text_to_speech/ming_tts/run_curl.sh"
+    ``````
@@ -17,6 +17,7 @@ list of supported architectures across all modalities, see
 | CosyVoice3 | `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | 2 (talker + code2wav) | ✓ | ✓ | — | 24 kHz |
 | Fish Speech S2 Pro | `fishaudio/s2-pro` | dual-AR | ✓ | ✓ | — | 44.1 kHz |
 | GLM-TTS | `zai-org/GLM-TTS` | 2 (AR + DiT) | ✓ (required) | ✓ | — | 24 kHz |
+| Ming-omni-tts | `inclusionAI/Ming-omni-tts-0.5B` | 2 (AR + audio VAE) | ✓ | ✓ | style / IP / dialect / TTA / podcast | 44.1 kHz |
 | Ming-flash-omni-TTS | `Jonathan1909/Ming-flash-omni-2.0` | single (talker only) | — (caption-controlled) | — | style / IP / basic captions | 44.1 kHz |
 | MOSS-TTS-Nano | `OpenMOSS-Team/MOSS-TTS-Nano` | single (AR + codec) | ✓ (required) | ✓ | voice_clone, continuation | 48 kHz |
 | OmniVoice | `k2-fsa/OmniVoice` | 2 (gen + dec) | ✓ | — | voice design, language hint | 24 kHz |
@@ -159,6 +160,46 @@ Streaming requires `async_chunk: true` in the stage config.
 
 ---
 
+## Ming-omni-tts
+
+Dense 0.5B two-stage TTS pipeline (`AR + flow` + audio VAE) at 44.1 kHz. The example covers style, IP voice, music-only generation, text-to-audio events, emotion, dialect, zero-shot cloning, podcast, speech+BGM, and speech+environment-sound cases.
+
+### Quick start
+```bash
+python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
+    --case style \
+    --deploy-config vllm_omni/deploy/ming_tts.yaml \
+    --enforce-eager
+```
+
+### Voice cloning
+```bash
+python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
+    --case zero_shot \
+    --ref-audio /path/to/reference.wav \
+    --ref-text "在此奉劝大家别乱打美白针。" \
+    --deploy-config vllm_omni/deploy/ming_tts.yaml \
+    --enforce-eager
+```
+
+### Streaming
+```bash
+python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
+    --case basic \
+    --ref-audio /path/to/reference.wav \
+    --streaming \
+    --deploy-config vllm_omni/deploy/ming_tts.yaml \
+    --enforce-eager
+```
+
+### Notes
+- `style`, `ip`, `bgm`, and `tta` do not require reference audio.
+- Reference-audio cases use `--ref-audio`; `zero_shot` also requires `--ref-text`.
+- `podcast` uses multiple references via `--ref-audio-paths`.
+- Full case details live in [`ming_tts/README.md`](ming_tts/README.md).
+
+---
+
 ## Ming-flash-omni-TTS
 
 Standalone talker-only deployment of Ming-flash-omni-2.0 at 44.1 kHz. Voice is controlled through caption fields (`风格` / `IP` / `语速`/`基频`/`音量`) rather than reference audio.