Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
487345c
feat(ming-tts): add dense omni pipeline
akshatvishu Apr 18, 2026
9cda910
fix(ming-tts): serialize stage0 stop reason as tensor
akshatvishu Apr 18, 2026
9add4ef
docs: Update Ming TTS example
akshatvishu Apr 18, 2026
1bee58d
Refactor Ming TTS model layout
akshatvishu Apr 23, 2026
276b954
Extract shared async chunk transfer helpers
akshatvishu Apr 23, 2026
8bd43d1
Migrate Ming TTS to deploy config
akshatvishu Apr 23, 2026
ac4fe0a
Reuse shared speaker embedding loader
akshatvishu Apr 23, 2026
d1920a5
fix: resolve F821 undefined name by adding raw_request to audio chunk…
akshatvishu Apr 23, 2026
e8b97bd
fix(ming_tts): update hf_architectures to match BailingMMNative archi…
akshatvishu Apr 23, 2026
6b8f2c3
fix(config): ensure DeployConfig.pipeline override is honored when au…
akshatvishu Apr 23, 2026
d0a51e8
fix ming_tts offline runner truncating multi-chunk audio
akshatvishu Apr 23, 2026
01055e3
docs: migrate Ming TTS docs to deploy config
akshatvishu Apr 28, 2026
12ac6a7
Merge remote-tracking branch 'origin/main' into feat/ming-omni-tts-dense
akshatvishu Apr 28, 2026
169800b
fix incorrectly importing OmniServerParams
akshatvishu Apr 28, 2026
b4187e3
tests: align Ming TTS offline coverage
akshatvishu Apr 28, 2026
5aeb88c
tests: fix Ming TTS online imports
akshatvishu Apr 28, 2026
dc2ea22
test(ming_tts): fix L3 runtime and deploy config path
akshatvishu Apr 28, 2026
a781827
fix(ming_tts): align async chunk payload with generation adapter
akshatvishu Apr 28, 2026
0508d2b
Fix Ming TTS codec frame rate derivation for online serving
akshatvishu Apr 28, 2026
80de956
Merge remote-tracking branch 'upstream/main' into feat/ming-omni-tts-…
akshatvishu May 13, 2026
88199f4
refactor(ming-tts): flatten prompt helpers and remove legacy dense
akshatvishu May 13, 2026
42bacb4
test(ming-tts): update e2e coverage for prompt helper refactor
akshatvishu May 13, 2026
ee047ce
style: reorder imports in ming_tts/qwen3 and fix noqa in test_serving…
akshatvishu May 13, 2026
55f9025
Add Ming dense prompt utilities
akshatvishu May 13, 2026
f5dd5bb
Fix Ming dense config initialization order
akshatvishu May 13, 2026
9609ece
Fix Ming dense config initialization and e2e validation
akshatvishu May 13, 2026
bf95f76
Disable prefix caching for Ming dense TTS
akshatvishu May 13, 2026
5170171
vllm_omni/entrypoints/openai/serving_speech.py
akshatvishu May 13, 2026
0b7eeca
fix(ming_tts): align llm2audio_vae signature with custom_process_inpu…
akshatvishu May 13, 2026
dc3240a
fix(ming_tts): fall back to soundfile when torchcodec unavailable
akshatvishu May 13, 2026
b72dea9
Fix Ming speaker audio fallback
akshatvishu May 13, 2026
6edc6ff
Align Ming podcast prompt formatting
akshatvishu May 13, 2026
736f7b8
ming-tts: address decode state and ISTFT reuse concerns
akshatvishu May 14, 2026
f1a8179
fix(ming_tts_llm): allow compute_logits with plain tensor during prof…
akshatvishu May 14, 2026
e0dfa42
ming-tts: align sampled multimodal stop state for logits
akshatvishu May 14, 2026
94c91a9
fix(ming-tts): skip text-mode requests in decode window validation
akshatvishu May 14, 2026
9ac9689
refactor(ming-tts): prune dense runtime dead code
akshatvishu May 26, 2026
f3f730c
Merge branch 'main' into feat/ming-omni-tts-dense
akshatvishu May 26, 2026
51d03d4
style: apply pre-commit formatting fixes
akshatvishu May 26, 2026
a10ed8f
test(ming-tts): update zh evaluation prompt in e2e tests
akshatvishu May 26, 2026
dfebed5
refactor(ming-tts): centralize stop reason metadata
akshatvishu May 26, 2026
e9519a0
test(ming-tts): keep branch tests focused on e2e
akshatvishu May 26, 2026
93bde5f
refactor(ming-tts): remove redundant validation from FlowLoss.sample
akshatvishu May 26, 2026
c924b34
refactor(ming-tts): remove dead conditioning dropout arg
akshatvishu May 26, 2026
0b2d070
refactor(ming-tts): use runner request id
akshatvishu May 27, 2026
4d923c7
examples: consolidate Ming TTS examples
akshatvishu May 27, 2026
da94f8a
Add Ming-omni-tts 0.5b Dense recipe
akshatvishu May 27, 2026
2b9d5b5
refactor: remove dead ingress and preprocessor plumbing per review
akshatvishu May 28, 2026
a8a7bf7
fix(ming-tts): prevent abandoned stream leaks and fix encoder race co…
akshatvishu May 28, 2026
6155fae
chore(ming-tts): address config, pathing and defensive review nits
akshatvishu May 28, 2026
7fc12ef
Gate Ming TTS final-stage logging
akshatvishu May 28, 2026
9862a46
Split Ming TTS prompt helpers by responsibility
akshatvishu May 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ th {
| `Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-CustomVoice | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-VoiceDesign | `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-Base | `Qwen/Qwen3-TTS-12Hz-0.6B-Base` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `MingTTSForConditionalGeneration` | Ming-omni-tts dense 0.5B | `inclusionAI/Ming-omni-tts-0.5B` | ✅︎ | | | |
| `GLMTTSForConditionalGeneration` | GLM-TTS | `zai-org/GLM-TTS` | ✅︎ | | | |
| `NextStep11Pipeline` | NextStep-1.1 | `stepfun-ai/NextStep-1.1` | ✅︎ | ✅︎ | | ✅︎ |
| `MiMoAudioModel` | MiMo-Audio-7B-Instruct | `XiaomiMiMo/MiMo-Audio-7B-Instruct` | ✅︎ | ✅︎ | | |
Expand Down
140 changes: 140 additions & 0 deletions docs/user_guide/examples/offline_inference/ming_tts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Ming-omni-tts

Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/text_to_speech/ming_tts>.

This directory contains an offline Ming example that uses the in-repo Ming prompt builder directly. It covers the broader upstream dense 0.5B surface: style, IP, music-only generation, TTA, emotion, dialect, zero-shot clone, podcast, speech+bgm, and speech+sound.

## Quick Start

Run a zero-speaker style case:

```bash
python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
--case style \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--enforce-eager
```

Run emotion-controlled speech:

```bash
python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
--case emotion \
--ref-audio /path/to/emotion_prompt.wav \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--enforce-eager
```

Run zero-shot cloning with a transcript:

```bash
python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
--case zero_shot \
--ref-audio /path/to/reference.wav \
--ref-text "在此奉劝大家别乱打美白针。" \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--enforce-eager
```

Run podcast generation:

```bash
python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
--case podcast \
--ref-audio-paths /path/to/CTS-CN-F2F-2019-11-11-423-012-A.wav /path/to/CTS-CN-F2F-2019-11-11-423-012-B.wav \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--enforce-eager
```

Run text-to-audio event generation:

```bash
python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
--case tta \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--enforce-eager
```

Run with stats and a manifest:

```bash
python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
--case style \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--enforce-eager \
--enable-stats \
--stats-log-file output_audio/ming_style_pipeline.log \
--metadata-json output_audio/ming_style_manifest.json
```

## Built-in Cases

- `style`: zero-speaker style-conditioned speech
- `ip`: zero-speaker IP voice generation
- `bgm`: music generation
- `tta`: text-to-audio event generation with FlowLoss controls
- `emotion`: reference-audio speech with emotion control
- `basic`: reference-audio cloning with speed / pitch / volume control
- `dialect`: reference-audio cloning with dialect control
- `zero_shot`: reference-audio cloning with explicit transcript
- `podcast`: multi-reference dialogue generation with automatic speaker embedding extraction
- `speech_bgm`: speech with background music conditioning
- `speech_sound`: speech with environment sound conditioning

## Streaming

Use async_chunk streaming with `AsyncOmni`:

```bash
python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
--case basic \
--ref-audio /path/to/10002287-00000095.wav \
--streaming \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--enforce-eager
```

`--streaming` currently supports one prompt per process invocation. Use
blocking mode for `--num-prompts > 1`.

## Validation matrix

The example is intended to cover the dense TTS workflows used by the Ming
validation helper:

| Case | Blocking | Async chunk | Extra inputs |
|---|---:|---:|---|
| `style` | Yes | Optional smoke test | none |
| `ip` | Yes | Optional smoke test | none |
| `bgm` | Yes | Optional smoke test | none |
| `tta` | Yes | Optional smoke test | none |
| `emotion` | Yes | Yes | reference WAV |
| `basic` | Yes | Yes | reference WAV |
| `dialect` | Yes | Yes | reference WAV |
| `zero_shot` | Yes | Yes | reference WAV and transcript |
| `podcast` | Yes | Yes | two reference WAVs |
| `speech_bgm` | Yes | Yes | reference WAV |
| `speech_sound` | Yes | Yes | reference WAV |

The offline example also exposes vLLM-Omni runtime/reporting controls such as:

- `--num-prompts`
- `--enable-stats`
- `--stats-log-file`
- `--metadata-json`
- `--stage-init-timeout`
- `--init-timeout`
- `--batch-timeout`
- `--worker-backend`
- `--ray-address`

## Example materials

??? abstract "README.md"
``````md
--8<-- "examples/offline_inference/text_to_speech/ming_tts/README.md"
``````
??? abstract "end2end.py"
``````py
--8<-- "examples/offline_inference/text_to_speech/ming_tts/end2end.py"
``````
186 changes: 186 additions & 0 deletions docs/user_guide/examples/online_serving/ming_tts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# Ming-omni-tts

Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/text_to_speech/ming_tts>.

This example shows how to serve Ming through the OpenAI-compatible `/v1/audio/speech` endpoint. The server builds Ming prompts directly with the in-repo prompt builder, so online requests support Ming-specific structured controls instead of the Qwen placeholder path.

## Installation

Please refer to [README.md](https://github.com/vllm-project/vllm-omni/tree/main/README.md)

## Launch the Server

```bash
vllm-omni serve inclusionAI/Ming-omni-tts-0.5B \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--omni \
--port 8091 \
--enforce-eager
```

Or:

```bash
cd examples/online_serving/text_to_speech/ming_tts
./run_server.sh
```

The canonical Ming online client is `openai_speech_client.py`. It targets the
local vLLM-Omni server, not OpenAI's cloud API, so `api_key=EMPTY` is enough
for local testing.

## Example Requests

Basic TTS:

```bash
python openai_speech_client.py \
--text "你好,这是 Ming 在线语音合成测试。"
```

Style-conditioned speech:

```bash
python openai_speech_client.py \
--text "我会一直在这里陪着你。" \
--instructions "轻柔的ASMR耳语,慢速,贴近麦克风"
```

Structured Ming control:

```bash
python openai_speech_client.py \
--text "我觉得社会企业同个人都有责任" \
--instruction-json '{"方言":"广粤话"}'
```

IP voice generation:

```bash
python openai_speech_client.py \
--text "这款产品的名字,叫变态坑爹牛肉丸。" \
--voice 灵小甄
```

Reference-audio cloning:

Use `ref_audio` by itself for Ming prompt-waveform conditioning. Add
`ref_text` when the request is transcript cloning, such as zero-shot or
podcast-style prompts.

```bash
python openai_speech_client.py \
--task-type Base \
--text "我们的愿景是构建未来服务业的数字化基础设施。" \
--ref-audio /path/to/reference.wav \
--ref-text "在此奉劝大家别乱打美白针。"
```

Speaker-embedding cloning:

```bash
python openai_speech_client.py \
--task-type Base \
--text "你好,这是一段使用说话人向量的合成语音。" \
--speaker-embedding /path/to/ming_speaker_embedding.json
```

Streaming PCM:

```bash
python openai_speech_client.py \
--text "你好,这是流式输出测试。" \
--instructions "平静,普通话" \
--stream \
--output ming_output.pcm
```

## Curl Helper

Use the bundled helper for common request types:

```bash
./run_curl.sh basic
./run_curl.sh style
./run_curl.sh ip
REF_AUDIO=/path/to/emotion_prompt.wav ./run_curl.sh emotion
REF_AUDIO=/path/to/yue_prompt.wav ./run_curl.sh dialect
REF_AUDIO=/path/to/reference.wav REF_TEXT="在此奉劝大家别乱打美白针。" ./run_curl.sh zero_shot
REF_AUDIO=/path/to/speaker_1.wav REF_AUDIO_2=/path/to/speaker_2.wav REF_TEXT="speaker_1:你好。 speaker_2:你好。" ./run_curl.sh podcast
REF_AUDIO=/path/to/00000309-00000300.wav ./run_curl.sh speech_bgm
REF_AUDIO=/path/to/00000309-00000300.wav ./run_curl.sh speech_sound
REF_AUDIO=/path/to/reference.wav REF_TEXT="在此奉劝大家别乱打美白针。" ./run_curl.sh clone_ref_audio
SPEAKER_EMBEDDING=/path/to/ming_speaker_embedding.json ./run_curl.sh clone_embedding
./run_curl.sh stream
```

## Audio Inputs

- `ref_audio` accepts a local path, remote URL, or `data:` URL
- The Python client converts local files into a base64 `data:` URL
- `speaker_embedding` must be a JSON file with exactly 192 numeric values
- Ming prompt-waveform cases can use `ref_audio` without `ref_text`
- Zero-shot and podcast-style transcript cloning should include `ref_text`

The bundled `run_curl.sh basic` mode is plain/default TTS and does not require
`REF_AUDIO`. The upstream cookbook-style `basic` case uses `ref_audio` plus
structured speed / pitch / volume instructions.

## Request Types

Ming online serving supports these request families through `/v1/audio/speech`:

| Case | Online support | Required fields |
|------|----------------|-----------------|
| default TTS | Supported | `input`, `max_new_tokens=200` |
| `style` | Supported | `input`, `instructions`, `max_new_tokens=200` |
| `ip` | Supported | `input`, `voice`, `max_new_tokens=200` |
| `basic` helper | Supported | `input`, `max_new_tokens=200` |
| upstream `basic` case | Supported | `input`, `ref_audio`, structured speed / pitch / volume `instructions`, `max_new_tokens=200` |
| `emotion` | Supported | `input`, `ref_audio`, structured emotion `instructions`, `max_new_tokens=200` |
| `dialect` | Supported | `input`, `language` or structured `instructions`, `ref_audio`, `max_new_tokens=200` |
| `zero_shot` | Supported | `input`, `ref_audio`, `ref_text`, `max_new_tokens=200` |
| `podcast` | Supported | `input`, repeated/list `ref_audio`, `ref_text`, `max_new_tokens=200` |
| `speech_bgm` | Supported | `input`, `ref_audio`, structured `instructions` with `{"BGM": ...}`, `max_new_tokens=200` |
| `speech_sound` | Supported | `input`, `ref_audio`, structured `instructions` with `{"BGM": {"ENV": ...}}`, `max_new_tokens=200` |
| `bgm` | Not supported online | Requires a future `prompt_mode=music` API extension |
| `tta` | Not supported online | Requires a future `prompt_mode=tta` API extension |

The online endpoint is speech-shaped today. Music-only `bgm` and text-to-audio
`tta` remain offline workflows.

## Field Mapping

For Ming, the generic OpenAI request fields map to Ming controls like this:

- `input` -> target text
- `instructions` -> Ming instruction string, or a JSON string for the structured Ming control object
- `voice` -> Ming `IP`
- `language` -> Ming `方言`
- `ref_audio` -> Ming prompt waveform
- `ref_text` -> optional transcript for zero-shot and podcast-style cloning
- `speaker_embedding` -> 192-d Ming speaker embedding

## Voice Listing

- `/v1/audio/voices` lists uploaded voices for Ming.
- Built-in Ming IP labels can still be used as `voice`, but they are not enumerated by the API.

## Example materials

??? abstract "README.md"
``````md
--8<-- "examples/online_serving/text_to_speech/ming_tts/README.md"
``````
??? abstract "run_server.sh"
``````sh
--8<-- "examples/online_serving/text_to_speech/ming_tts/run_server.sh"
``````
??? abstract "openai_speech_client.py"
``````py
--8<-- "examples/online_serving/text_to_speech/ming_tts/openai_speech_client.py"
``````
??? abstract "run_curl.sh"
``````sh
--8<-- "examples/online_serving/text_to_speech/ming_tts/run_curl.sh"
``````
41 changes: 41 additions & 0 deletions examples/offline_inference/text_to_speech/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ list of supported architectures across all modalities, see
| CosyVoice3 | `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | 2 (talker + code2wav) | ✓ | ✓ | — | 24 kHz |
| Fish Speech S2 Pro | `fishaudio/s2-pro` | dual-AR | ✓ | ✓ | — | 44.1 kHz |
| GLM-TTS | `zai-org/GLM-TTS` | 2 (AR + DiT) | ✓ (required) | ✓ | — | 24 kHz |
| Ming-omni-tts | `inclusionAI/Ming-omni-tts-0.5B` | 2 (AR + audio VAE) | ✓ | ✓ | style / IP / dialect / TTA / podcast | 44.1 kHz |
| Ming-flash-omni-TTS | `Jonathan1909/Ming-flash-omni-2.0` | single (talker only) | — (caption-controlled) | — | style / IP / basic captions | 44.1 kHz |
| MOSS-TTS-Nano | `OpenMOSS-Team/MOSS-TTS-Nano` | single (AR + codec) | ✓ (required) | ✓ | voice_clone, continuation | 48 kHz |
| OmniVoice | `k2-fsa/OmniVoice` | 2 (gen + dec) | ✓ | — | voice design, language hint | 24 kHz |
Expand Down Expand Up @@ -159,6 +160,46 @@ Streaming requires `async_chunk: true` in the stage config.

---

## Ming-omni-tts

Dense 0.5B two-stage TTS pipeline (`AR + flow` + audio VAE) at 44.1 kHz. The example covers style, IP voice, music-only generation, text-to-audio events, emotion, dialect, zero-shot cloning, podcast, speech+BGM, and speech+environment-sound cases.

### Quick start
```bash
python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
--case style \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--enforce-eager
```

### Voice cloning
```bash
python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
--case zero_shot \
--ref-audio /path/to/reference.wav \
--ref-text "在此奉劝大家别乱打美白针。" \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--enforce-eager
```

### Streaming
```bash
python examples/offline_inference/text_to_speech/ming_tts/end2end.py \
--case basic \
--ref-audio /path/to/reference.wav \
--streaming \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--enforce-eager
```

### Notes
- `style`, `ip`, `bgm`, and `tta` do not require reference audio.
- Reference-audio cases use `--ref-audio`; `zero_shot` also requires `--ref-text`.
- `podcast` uses multiple references via `--ref-audio-paths`.
- Full case details live in [`ming_tts/README.md`](ming_tts/README.md).

---

## Ming-flash-omni-TTS

Standalone talker-only deployment of Ming-flash-omni-2.0 at 44.1 kHz. Voice is controlled through caption fields (`风格` / `IP` / `语速`/`基频`/`音量`) rather than reference audio.
Expand Down
Loading
Loading