[WIP]: Add Ming-omni-tts dense 0.5B pipeline by akshatvishu · Pull Request #2906 · vllm-project/vllm-omni

akshatvishu · 2026-04-18T19:32:28Z

Purpose

Add Ming-omni-tts dense 0.5B support to vLLM-Omni via a two-stage AR+Flow → Audio VAE pipeline.

Original repo : https://github.com/inclusionAI/Ming-omni-tts

Resolves:
#1461

Changes:
Model files (vllm_omni/model_executor/models/ming_tts/)

ming_tts.py — top-level two-stage dispatcher and weight-loading entry point
ming_tts_llm.py — Stage-0 Qwen2 AR backbone with inline Aggregator, FlowLoss, stop head, and latent patch emission
ming_tts_audio_vae.py — Stage-1 Audio VAE decoder producing 44.1 kHz mono waveform output
config_ming_tts.py — Ming dense constants, runtime keys, latent sizes, token IDs, stop-head defaults, and sample-rate validation
configuration_ming_dense.py — Hugging Face config adapter for inclusionAI/Ming-omni-tts-0.5B
prompt_builder.py — prompt construction for speech, music, instructions, TTA, prompt waveform, and speaker embeddings
ingress.py — first-stage prompt ingestion for the disaggregated pipeline
speaker_extractor.py — CampPlus 192-d speaker embedding extraction for reference audio
fm/ — Flow Matching modules used by Stage-0 latent generation
audio_tokenizer/ — Ming Audio VAE tokenizer and decoder support modules

Registry

Register MingTTSForConditionalGeneration, MingLLMModel, and MingAudioVAEModel in vllm_omni/model_executor/models/registry.py

Stage config & input processors

vllm_omni/model_executor/stage_configs/ming_tts.yaml — sequential two-stage AR+Flow → Audio VAE pipeline
vllm_omni/model_executor/stage_configs/ming_tts_async_chunk.yaml — async chunk pipeline with SharedMemoryConnector, latent_chunk_size: 25, and max_num_seqs: 1
vllm_omni/model_executor/stage_input_processors/ming_tts.py — Stage-0 → Stage-1 latent patch transfer for llm2audio_vae and llm2audio_vae_async_chunk, including final partial chunk flush

Offline examples

examples/offline_inference/ming_tts/end2end.py — end-to-end Omni example covering 11 cookbook cases: style, ip, bgm, tta, emotion, basic, dialect, zero_shot, podcast, speech_bgm, speech_sound
examples/offline_inference/ming_tts/README.md — offline launch notes for sequential and async chunk runs

Online serving

vllm_omni/entrypoints/openai/serving_speech.py — Ming prompt builder for OpenAI-compatible /v1/audio/speech, with structured instructions, voice → IP, language → dialect, reference audio, 192-d speaker embeddings, podcast multi-speaker conditioning, and streaming PCM/WAV output
examples/online_serving/ming_tts/run_server.sh — async chunk server launch script
examples/online_serving/ming_tts/openai_speech_client.py — API client covering Ming controls and streaming output
examples/online_serving/ming_tts/run_curl.sh — curl examples for /v1/audio/speech
examples/online_serving/ming_tts/README.md and docs/user_guide/examples/online_serving/ming_tts.md — online serving documentation

Architecture:

Stage 0: Qwen2ForCausalLM + Aggregator + FlowLoss → latent audio patches
Stage 1: Ming Audio VAE → 44.1 kHz mono waveform

Known limitations / follow-ups:

Online /v1/audio/speech does not yet expose prompt_mode=music/tta or FlowLoss controls (cfg, sigma, temperature); online BGM and TTA require a future prompt-mode API extension.
Stage configs use max_num_seqs: 1; multi-request batching is not yet validated.
latent_chunk_size: 5 improves online TTFP significantly but diverges on podcast in the offline async matrix; repo YAML stays on the validated latent_chunk_size: 25 default until that is resolved.

Test Plan

Validation was performed on an NVIDIA L4 GPU (Colab).

Offline sequential — full 11-case cookbook matrix :

python examples/offline_inference/ming_tts/end2end.py --case <case>

Offline async_chunk — full 11-case cookbook matrix:

python examples/offline_inference/ming_tts/end2end.py \
    --case <case> \
    --streaming \
    --stage-configs-path vllm_omni/model_executor/stage_configs/ming_tts_async_chunk.yaml \
    --enforce-eager

Online serving — /v1/audio/speech async_chunk checks:

# Start server
vllm-omni serve inclusionAI/Ming-omni-tts-0.5B \
    --stage-configs-path vllm_omni/model_executor/stage_configs/ming_tts_async_chunk.yaml \
    --omni --enforce-eager

# Run client checks (style, ip, basic, emotion, dialect, zero_shot, podcast,
# speech_bgm, speech_sound, streaming PCM, ref_audio, speaker_embedding, podcast multi-ref)

Test Result

Offline correctness — sequential vs. async_chunk (latent_chunk_size: 25):

All 11 cases produced identical frame counts and Stage-1 total patch counts between sequential and default async_chunk, confirming correct Stage-0 → Stage-1 handoff and final partial chunk flush.

Case	Frames / Patches / Audio (s)	Seq = Async25
`style`	409248 / 29 / 9.28	✅
`ip`	183456 / 13 / 4.16	✅
`bgm`	1326528 / 94 / 30.08	✅
`tta`	465696 / 33 / 10.56	✅
`emotion`	324576 / 23 / 7.36	✅
`basic`	211680 / 15 / 4.80	✅
`dialect`	239904 / 17 / 5.44	✅
`zero_shot`	409248 / 29 / 9.28	✅
`podcast`	437472 / 31 / 9.92	✅
`speech_bgm`	296352 / 21 / 6.72	✅
`speech_sound`	352800 / 25 / 8.00	✅

Upstream FlashAttention comparison (cold, single-request, L4):

Upstream: torch 2.6.0+cu124, FlashAttention 2.7.4.post1. vLLM-Omni VAE stage runs through SDPA, not upstream FlashAttention. Integration comparison, not kernel parity benchmark.

Case	Upstream RTF	vLLM Seq RTF	vLLM Async25 RTF
`style`	0.704	1.026	0.709
`ip`	0.695	0.978	1.045
`bgm`	0.692	0.611	0.571
`emotion`	0.688	0.823	0.830
`basic`	0.689	0.918	0.917
`dialect`	0.684	0.869	0.915
`zero_shot`	0.692	0.754	0.684
`podcast`	0.697	0.735	0.676
`speech_bgm`	0.687	0.823	0.820
`speech_sound`	0.681	0.772	0.808
(avg, 10 cases)	0.691	0.831	0.798

vLLM-Omni matches or beats upstream RTF on bgm; async25 is near-parity on style, zero_shot, and podcast. Cold single-request numbers include engine startup and first-request lazy setup costs.

Warm-cache RTF vs upstream (L4, post-warmup, 1 warmup + 1 measured request):

Warm-cache removes first-request lazy setup. Fairer per-request comparison against upstream.

Case	Upstream RTF	vLLM Seq RTF (warm)	vLLM Async25 RTF (warm)	Seq delta vs upstream
`style`	0.704	0.563	0.549	-20.0%
`ip`	0.695	0.565	0.569	-18.7%
`bgm`	0.692	0.571	0.520	-17.5%
`emotion`	0.688	0.548	0.571	-20.3%
`basic`	0.689	0.572	0.602	-17.0%
`dialect`	0.684	0.569	0.557	-16.8%
`zero_shot`	0.692	0.565	0.491	-18.4%
`podcast`	0.697	0.559	0.521	-19.8%
`speech_bgm`	0.687	0.570	0.574	-17.0%
`speech_sound`	0.681	0.555	0.568	-18.5%
(avg, 10 cases)	0.691	0.564	0.552	-18.4%

Warm vLLM-Omni sequential beats upstream FlashAttention RTF across all 10 measured cases. Async25 further reduces RTF for longer/reference-conditioned cases and the zero-ref style / bgm runs.

Warm-cache offline benchmark (L4, 1 warmup + 1 measured request):

Case	Seq wall / RTF	Async25 wall / RTF / TTFP	Delta
`style`	4.864s / 0.563	4.743s / 0.549 / 4.675s	-2.5%
`ip`	2.169s / 0.565	2.185s / 0.569 / 2.181s	+0.7%
`bgm`	17.174s / 0.571	15.632s / 0.520 / 4.819s	-9.0%
`zero_shot`	5.425s / 0.565	4.711s / 0.491 / 4.515s	-13.2%
`podcast`	5.547s / 0.559	5.164s / 0.521 / 4.740s	-6.9%
`tta`	5.830s / 0.552	5.571s / 0.528 / 4.752s	-4.4%
`basic`	3.112s / 0.572	3.277s / 0.602 / 3.273s	+5.3%

Async chunk benefits longer/reference-conditioned cases; overhead roughly cancels the overlap benefit for short speech cases.

Online serving benchmark (10 prompts, concurrency 1, eager, L4):

Config	Mean TTFP (ms)	Mean E2E (ms)	Mean RTF
`sequential_eager`	3354.83	3357.01	0.561
`async_chunk_eager` (chunk=25)	3450.28	3452.35	0.577
`async_chunk_bench` (chunk=5)	911.20	2985.04	0.499

latent_chunk_size: 5 reduces mean TTFP by ~73% and E2E by ~11% vs. sequential, but remains experimental pending podcast offline finalization.

Online /v1/audio/speech validation (async_chunk, all speech-mode cases):

All cases returned valid WAV at 44.1 kHz. Streaming PCM returned progressive chunks. Reference audio, speaker embedding, and podcast multi-reference checks passed.

Case	Output	Size (bytes)	Sample rate	Frames
`style`	WAV	790316	44100	395136
`ip`	WAV	366956	44100	183456
`basic`	WAV	536300	44100	268128
`emotion`	WAV	649196	44100	324576
`dialect`	WAV	395180	44100	197568
`zero_shot`	WAV	931436	44100	465696
`podcast`	WAV	846764	44100	423360
`speech_bgm`	WAV	677420	44100	338688
`speech_sound`	WAV	649196	44100	324576
`streaming`	PCM	338688	N/A	N/A

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector · 2026-04-18T19:32:34Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

akshatvishu · 2026-04-18T19:45:44Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. Credits must be used to enable repository wide code reviews.

I guess everyone is suffering under the new limits (╥_╥)

hsliuustc0106

This PR is marked as [WIP] and is substantial (~10,500 lines / 47 files).

Could you please run the L3 tests locally and paste the results here? This helps validate the integration on your end before we proceed with full review.

hsliuustc0106

please make changes accordingly after #2383 merged. For the model usage, I suggest to write a model recipe under vllm_omni/recipes using the template. It seems there are some duplicate/dead codes as well, can you try to compress it first?

hsliuustc0106 · 2026-04-19T00:52:03Z

I also recommend you to use the add-tts-models skill

linyueqian

Thanks for the thorough test matrix and the warm-cache RTF numbers, those are the right kind of evidence for a model-add PR. At 10.5k additions the PR is hard to review carefully. I think it can stay as one PR if we condense it by reusing modules that already live in the repo. Inline comments below on the specific files, ordered roughly by expected line savings.

Not blocking merge, flagging for the author and maintainers.

linyueqian · 2026-04-19T01:15:35Z

@@ -0,0 +1,868 @@
+# SPDX-License-Identifier: Apache-2.0


[MAJOR] This file is 868 lines and mixes a Qwen2 AR backbone, the Aggregator, FlowLoss head, stop head, and latent patch emission. Two asks:

The Qwen2 backbone should import from upstream vLLM (from vllm.model_executor.models.qwen2 import Qwen2Model) rather than being reimplemented here. qwen3_omni Thinker and qwen3_tts Talker both follow that pattern. Estimated saving: 300 to 400 lines.

After that, please split the remainder into backbone.py, aggregator.py, flowloss_head.py, patch_emission.py. Our coding-style guideline targets 200 to 400 lines per file with 800 as a hard cap.

linyueqian · 2026-04-19T01:15:35Z

@@ -0,0 +1,207 @@
+# SPDX-License-Identifier: Apache-2.0


[MAJOR] cosyvoice3/code2wav_core/cfm.py (325 lines) already implements Conditional Flow Matching. This PR adds fm/cfm.py (207) plus fm/modules.py (147), for roughly 350 lines of duplicated logic.

Suggestion: promote the cosyvoice3 CFM plus a DiT base to vllm_omni/model_executor/modules/flow_matching/, have Ming import it, and keep only fm/dit.py (Ming-specific conditioning) and fm/flowloss.py here.

This is a cross-model refactor, fine to land as a prerequisite PR owned by a maintainer or cc @yuanheng-zhao rather than blocking Ming on it. Worth an issue link from the PR body at minimum.

linyueqian · 2026-04-19T01:15:35Z

@@ -0,0 +1,291 @@
+# SPDX-License-Identifier: Apache-2.0


[MAJOR] The chunk-accumulate-and-final-flush logic here is the same pattern used by stage_input_processors/qwen3_tts.py for talker2code2wav_async_chunk. Extract a helper to stage_input_processors/_chunk.py (or _chunk_transfer.py) and have both models call it. Likely 80 to 120 lines saved across the two files plus easier future maintenance when the SharedMemoryConnector contract changes.

linyueqian · 2026-04-19T01:15:35Z

@@ -0,0 +1,188 @@
+# SPDX-License-Identifier: Apache-2.0


[MINOR] Pure math. qwen3_tts/tokenizer_25hz/ and voxtral_tts/voxtral_tts_audio_tokenizer.py also ship an iSTFT. Recommend opening a follow-up issue to migrate all three to a shared vllm_omni/model_executor/modules/audio/stft.py. Not a blocker on this PR, but please file the issue so this does not go cold.

linyueqian · 2026-04-19T01:15:35Z

@@ -0,0 +1,66 @@
+# SPDX-License-Identifier: Apache-2.0


[MINOR] Small file, quick check: does cosyvoice3 already load the CampPlus 192-d speaker embedder somewhere under cosyvoice3/utils.py or cosyvoice3/tokenizer.py? If yes, share the loader.

linyueqian · 2026-04-19T01:15:35Z

@@ -0,0 +1,581 @@
+# SPDX-License-Identifier: Apache-2.0


[MINOR] 581 lines reads as two concerns: the top-level two-stage dispatcher and the weight loader. Split the loader into loader.py and keep the dispatcher plus registry wiring in __init__.py + ming_tts.py at about 200 lines each.

linyueqian · 2026-04-19T01:15:35Z

@@ -0,0 +1,364 @@
+# SPDX-License-Identifier: Apache-2.0


[MINOR] 364 lines of constants, runtime keys, token IDs, stop-head defaults, and sample-rate validation. Split into constants.py and validation.py.

linyueqian · 2026-04-19T01:15:35Z

@@ -0,0 +1,86 @@
+async_chunk: true


[MAJOR] #2383 (config refactor 2/N) replaces stage_configs/*.yaml with a two-layer PipelineConfig (Python) plus deploy/<model>.yaml split. Follow-up 2c will remove --stage-configs-path and the legacy ModelPipeline path entirely.

If this PR lands ahead of 2c, please open a migration task on your side so Ming ships a pipeline.py plus deploy/ming_tts.yaml immediately after. Preferable: rebase onto #2383 once it merges and ship directly on the new schema to avoid a rewrite. cc @lishunyang12 for coordination.

linyueqian · 2026-04-19T01:15:35Z

@@ -0,0 +1,654 @@
+# SPDX-License-Identifier: Apache-2.0


[MINOR] 654 lines is almost entirely an 11-case dispatch ladder. Please move the case definitions to cases.yaml (prompt, ref_audio, speaker, expected sample rate, etc.) and keep a roughly 100-line driver that parametrizes off that file. CI can then iterate the same cases. Expected saving: about 500 lines here alone.

linyueqian · 2026-04-19T01:15:35Z

@@ -0,0 +1,217 @@
+#!/bin/bash


[NIT] 217 lines of curl examples read as documentation rather than a runnable script. Prefer fenced code blocks inside README.md, keep run_curl.sh as a short "here are three sanity checks" helper.

yuanheng-zhao · 2026-04-19T01:59:41Z

@akshatvishu It seems there're a lot added files that could re-use modules from the talker of Ming-flash-omni-2.0 in #2890 , especially modelings such as talker llm, talker vae, fm, spkemb extractor.

I'll update #2890 later today and try to merge it ASAP and then you might want to rebase

cc @linyueqian @hsliuustc0106

akshatvishu · 2026-04-20T14:41:08Z

@yuanheng-zhao Sure, I will wait for #2890 to get merge and will then start working on the suggestion left by @linyueqian as it seems like I can borrow a lot from Ming-flash-omni-2.0; after that I will run and upload the results of L3 test as requested by @hsliuustc0106

yuanheng-zhao · 2026-04-23T04:35:49Z

Hey @akshatvishu , the Ming-flash-omni-2.0 talker (modelings of that model for TTS & Omni-Speech) has been merged to main, let's rebase onto main with cutting off from the talker stage changes. For example,

git rebase --onto main the-talker-branch your-current-branch

Note to fetch and have latest main and my branch on your local

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

…s signature Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

…tecture Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

…to-detection fails Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

akshatvishu requested a review from hsliuustc0106 as a code owner April 18, 2026 19:32

hsliuustc0106 reviewed Apr 18, 2026

View reviewed changes

hsliuustc0106 reviewed Apr 19, 2026

View reviewed changes

linyueqian reviewed Apr 19, 2026

View reviewed changes

akshatvishu added 3 commits April 23, 2026 19:30

feat(ming-tts): add dense omni pipeline

487345c

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

fix(ming-tts): serialize stage0 stop reason as tensor

9cda910

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

docs: Update Ming TTS example

9add4ef

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

akshatvishu force-pushed the feat/ming-omni-tts-dense branch from d949ec7 to 9add4ef Compare April 23, 2026 14:01

akshatvishu added 8 commits April 23, 2026 23:58

Refactor Ming TTS model layout

1bee58d

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

Extract shared async chunk transfer helpers

276b954

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

Migrate Ming TTS to deploy config

8bd43d1

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

Reuse shared speaker embedding loader

ac4fe0a

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

fix: resolve F821 undefined name by adding raw_request to audio chunk…

d1920a5

…s signature Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

fix(ming_tts): update hf_architectures to match BailingMMNative archi…

e8b97bd

…tecture Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

fix(config): ensure DeployConfig.pipeline override is honored when au…

6b8f2c3

…to-detection fails Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

fix ming_tts offline runner truncating multi-chunk audio

d0a51e8

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

zhangj1an mentioned this pull request Apr 25, 2026

[WIP][Model] Add GLM-TTS text-to-speech model support #3141

Draft

5 tasks

Conversation

akshatvishu commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 18, 2026

Uh oh!

akshatvishu commented Apr 18, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Apr 19, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanheng-zhao commented Apr 19, 2026

Uh oh!

akshatvishu commented Apr 20, 2026

Uh oh!

yuanheng-zhao commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

akshatvishu commented Apr 18, 2026 •

edited

Loading