[Model] Ming-flash-omni-2.0 Omni-Speech and TTS by yuanheng-zhao · Pull Request #2890 · vllm-project/vllm-omni

yuanheng-zhao · 2026-04-17T16:18:26Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Omni-Speech (thinker + talker stages) and TTS support for #1343

Test Plan

TTS (standalone talker)

# offline examples
python examples/offline_inference/ming_flash_omni_tts/end2end.py --case style
python examples/offline_inference/ming_flash_omni_tts/end2end.py --case ip
python examples/offline_inference/ming_flash_omni_tts/end2end.py --case basic

# online serving TTS
MODEL="/workspace/models/Ming-flash-omni-2.0"
PORT="8091"
STAGE_CONFIG="vllm_omni/model_executor/stage_configs/ming_flash_omni_tts.yaml"
vllm serve "$MODEL" \
    --stage-configs-path "$STAGE_CONFIG" \
    --port "$PORT" \
    --omni \
    --log-stats

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/workspace/models/Ming-flash-omni-2.0",
        "input": "我会一直在这里陪着你，直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗？",
        "response_format": "wav"
    }' --output ming_tts_online.wav

curl -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/workspace/models/Ming-flash-omni-2.0",
    "input": "春天来了，万物复苏，大地一片生机盎然。田野里的油菜花开得金灿灿的，蜜蜂在花丛中忙碌地采蜜。远处的山坡上，桃花和杏花竞相绽放，粉的白的交织在一起，美不胜收。清晨的微风带着泥土的芬芳，轻轻拂过脸颊，让人感到无比惬意。孩子们在田间小路上追逐嬉戏，老人们坐在门前晒太阳，享受着这份宁静与美好。",
    "speaker": "lingguang",
    "response_format": "wav"
  }' --output ming_tts_online_long_lingguang.wav

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "今日天气真系好靓，出去行下啦！",
    "instructions": "{\"方言\": \"粤语\", \"情感\": \"开心\"}",
    "response_format": "wav"
  }' --output output_cantonese.wav

Omni-Speech (thinker + talker)

# offline example
python examples/offline_inference/ming_flash_omni/end2end.py \
    --query-type use_image \
    --stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml \
    --modalities text,audio

python examples/offline_inference/ming_flash_omni/end2end.py \
    --query-type use_video \
    --stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml \
    --modalities text,audio

python examples/offline_inference/ming_flash_omni/end2end.py \
    --query-type use_audio \
    --stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml \
    --modalities text,audio

# online speech serving
MODEL="/workspace/models/Ming-flash-omni-2.0"
PORT="8091"
STAGE_CONFIG="vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml"
vllm serve "$MODEL" \
    --stage-configs-path "$STAGE_CONFIG" \
    --port "$PORT" \
    --omni \
    --log-stats

# send request and extract audio output
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "/workspace/models/Ming-flash-omni-2.0",
  "modalities": ["audio"],
  "messages": [
	{
		"role": "system", 
		"content": [
			{
				"type": "text", 
				"text": "你是一个友好的AI助手。\n\ndetailed thinking off"
			}
		]
	},
	{"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
  ]
}' | jq -r '.choices[0].message.audio.data' | base64 -d > output_omni_parrot.wav

Please see my subsequent comments

Test Result

#2890 (comment)
#2890 (comment)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

…e text segmentation boundaries Signed-off-by: LHXuuu <xulianhao.xlh@antgroup.com>

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

hsliuustc0106 · 2026-04-17T21:06:46Z

This PR is marked as [Do-Not-Merge]. Ready for full review when the Do-Not-Merge label is removed.

Note: When ready for review, this PR will need:

Pre-commit and docs checks to pass
L3 tests run locally (since this adds ~3300 LOC for a new model)
Complete test plan and results in PR description

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Co-authored-by: LHXuuu <xulianhao.xlh@antgroup.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao · 2026-04-20T14:15:44Z

Verifications in Recipe

Omni serving (Thinker + Talker)

Curl with image input:

curl http://localhost:8091/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Jonathan1909/Ming-flash-omni-2.0",
      "messages": [
        {"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
        {"role": "user", "content": [
          {"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg"}},
          {"type": "text", "text": "Describe this image in detail."}
        ]}
      ],
      "modalities": ["text"]
    }'

Output:

{"id":"chatcmpl-b1bd1f1a80dcb93e","object":"chat.completion","created":1776693638,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"message":{"role":"assistant","content":"The image captures a striking view of the Tokyo Skytree, framed by the delicate pink blossoms of cherry trees in full bloom. The iconic tower stands tall against a clear blue sky, with its intricate lattice structure and modern design visible through the branches. The vibrant contrast between the soft pink flowers and the bright blue backdrop creates a picturesque scene that highlights both natural beauty and architectural marvel.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":2076,"total_tokens":2153,"completion_tokens":77,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null,"metrics":null}

Curl with modalities audio and save wav file:

curl http://localhost:8091/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Jonathan1909/Ming-flash-omni-2.0",
      "messages": [
        {"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
        {"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
      ],
      "modalities": ["audio"]
    }' | jq -r '.choices[0].message.audio.data' | base64 -d > ming_omni_parrot.wav

Output:

ming_omni_parrot.wav

Curl with audio input (ASR) and output both text and audio:

curl http://localhost:8091/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Jonathan1909/Ming-flash-omni-2.0",
      "messages": [
        {"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
        {"role": "user", "content": [
          {"type": "audio_url", "audio_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg"}},
          {"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."}
        ]}
      ],
      "modalities": ["text", "audio"]
    }' | jq -r '.choices[0].message.content'

Output:

English the first words i spoke in the original phonograph a little piece of practical poetry mary had a little lamb its fleece was white as snow and everywhere that mary went the lamb was sure to go

Curl with streaming mode:

curl -N http://localhost:8091/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Jonathan1909/Ming-flash-omni-2.0",
      "messages": [
        {"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
        {"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
      ],
      "modalities": ["text"],
      "stream": true
    }'

Output (partial):

data: {"id":"chatcmpl-8d5652aa82bc8a0f","object":"chat.completion.chunk","created":1776694471,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"，以确保"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-8d5652aa82bc8a0f","object":"chat.completion.chunk","created":1776694471,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"它们的"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-8d5652aa82bc8a0f","object":"chat.completion.chunk","created":1776694471,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"健康和"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-8d5652aa82bc8a0f","object":"chat.completion.chunk","created":1776694471,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"幸福"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-8d5652aa82bc8a0f","object":"chat.completion.chunk","created":1776694471,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"。"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-8d5652aa82bc8a0f","object":"chat.completion.chunk","created":1776694471,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: [DONE]

hsliuustc0106

can we change to the cli overrides and rm the yamls after #2383

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao · 2026-04-20T14:58:49Z

Verifications in Recipe

TTS (standalone talker)

Curl cmd:

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Jonathan1909/Ming-flash-omni-2.0",
      "input": "我会一直在这里陪着你。",
      "response_format": "wav"
    }' --output ming_online.wav

Output:

ming_online.wav

Curl with speaker lingguang:

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Jonathan1909/Ming-flash-omni-2.0",
      "input": "春天来了，万物复苏，大地一片生机盎然。田野里的油菜花开得金灿灿的，蜜蜂在花丛中忙碌地采蜜。远处的山坡上，桃花和杏花竞相绽放，粉的白的交织在一起，美不胜收。清晨的微风带着泥土的芬芳，轻轻拂过脸颊，让人感到无比惬意。孩子们在田间小路上追逐嬉戏，老人们坐在门前晒太阳，享受着这份宁静与美好。",
      "speaker": "lingguang",
      "response_format": "wav"
    }' --output ming_online_lingguang.wav

Output:

ming_online_lingguang.wav

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao · 2026-04-20T15:06:49Z

can we change to the cli overrides and rm the yamls after #2383

@hsliuustc0106 As the model features are ready to use, shall we merge the supporting of talker (omni-speech, TTS) (this PR) and the image generation (#2875) first? Converting to the new config system require adapting and testing whole test cases on both PRs which might postpone our functionalities.

I could update the config system for all of Ming related features in a subsequent PR:

ASR (thinker only)
TTS (talker only)
omni serving (multi modality inputs, omni-speech)
image-gen (multi modality inputs, image output)

hsliuustc0106

do we update the docs?

yuanheng-zhao · 2026-04-21T05:53:13Z

do we update the docs?

I updated docs in these commits: https://github.com/vllm-project/vllm-omni/pull/2890/changes/faea882515db83b08fed34b2b63cbb3dd79dc636..0b1c1059106f67baeaee6790e1f891ac82fce3bf

Basically I trimmed up examples and moved most-representative cases to corresponding recipe. Any suggestion to further trim example directories?

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

amy-why-3459 · 2026-04-22T10:45:24Z

LGTM

yuanheng-zhao · 2026-04-22T13:32:16Z

cc @hsliuustc0106

yuanheng-zhao and others added 17 commits April 17, 2026 14:07

add draft

ce774a3

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

temp draft upd

d80c101

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

apply x_transformers utils

e71e910

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

add e2e TTS

0cb5db2

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

upd ming online TTS

be36fc0

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

upd thinker -> talker (1/N)

9d048b0

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

cleanup omni unified model

8737080

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

upd tts and omni speech paths via ming task type

dd24397

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

segment text (without TalkerTN)

a44a751

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

omni-speech path: default zero spk emb; port vocie settings

14785df

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

voice register quick fix

812e479

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

upd ming yaml

67a2bdd

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

quick fix local voice preset path

a881381

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

fix(ming-talker): preserve voice reference across segments and improv…

3b3fc6a

…e text segmentation boundaries Signed-off-by: LHXuuu <xulianhao.xlh@antgroup.com>

trivial: rm unused code

461e643

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

trivial: cleanup talker/text processing comments

903748b

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

fix code consistency

f7a1f87

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao added 4 commits April 18, 2026 06:05

trivial: ruff

fa06e47

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

upd use mrope handling

4f95a8d

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

upd Ming e2e and readme

2dc24b8

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

trivial: fix pre-commit

b762095

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao mentioned this pull request Apr 19, 2026

[WIP]: Add Ming-omni-tts dense 0.5B pipeline #2906

Open

yuanheng-zhao and others added 6 commits April 19, 2026 13:02

complement ming tests

f276db1

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

rm training args

3a1fd78

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

code cleanup

f62c371

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

cleanup code talker CFM

8ae4a56

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

upd Ming serving speech args

179e44b

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Canonicalize ref headers to Ming repo

c839df9

Co-authored-by: LHXuuu <xulianhao.xlh@antgroup.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao changed the title ~~[Do-Not-Merge][Model] Ming-flash-omni-2.0 Omni-Speech and TTS~~ [Model] Ming-flash-omni-2.0 Omni-Speech and TTS Apr 19, 2026

upd ref headers

faea882

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

hsliuustc0106 requested a review from ZeldaHuang April 20, 2026 12:53

Add ming recipe and trim example readme

3aa5d9e

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

hsliuustc0106 reviewed Apr 20, 2026

View reviewed changes

yuanheng-zhao added 2 commits April 20, 2026 14:31

upd recipe

cfda612

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

audio generator step debug log

3e161af

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao added 2 commits April 20, 2026 14:59

trim readme

0b1c105

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Merge from main

91ee8c3

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao requested a review from amy-why-3459 April 20, 2026 15:40

hsliuustc0106 added the ready label to trigger buildkite CI label Apr 21, 2026

hsliuustc0106 reviewed Apr 21, 2026

View reviewed changes

Comment thread tests/e2e/online_serving/test_ming_flash_omni_expansion.py Outdated

hsliuustc0106 added the omni-test label to trigger buildkite omni model test in nightly CI label Apr 21, 2026

yuanheng-zhao and others added 3 commits April 21, 2026 08:41

upd e2e test imports

1482a8b

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Merge branch 'main' into model/ming-omni-talker-draft

4a7548c

rm ming expansion test; add module dummy tests

b07d48e

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

amy-why-3459 reviewed Apr 21, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/models/ming_flash_omni/talker_modules/__init__.py Outdated

yuanheng-zhao and others added 3 commits April 22, 2026 00:04

Merge branch 'main' into model/ming-omni-talker-draft

3a1ba78

put talker modules into a single file

b1aaf56

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Merge branch 'main' into model/ming-omni-talker-draft

00a5009

yuanheng-zhao requested review from amy-why-3459 and hsliuustc0106 April 22, 2026 07:00

amy-why-3459 approved these changes Apr 22, 2026

View reviewed changes

hsliuustc0106 merged commit 3a32cd6 into vllm-project:main Apr 23, 2026
8 checks passed

yuanheng-zhao mentioned this pull request Apr 23, 2026

[WIP][Model] Add Ming-flash-omni-2.0 Image Generation (Diffusion) Stage #2875

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Ming-flash-omni-2.0 Omni-Speech and TTS#2890

[Model] Ming-flash-omni-2.0 Omni-Speech and TTS#2890
hsliuustc0106 merged 43 commits intovllm-project:mainfrom
yuanheng-zhao:model/ming-omni-talker-draft

yuanheng-zhao commented Apr 17, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 commented Apr 17, 2026

Uh oh!

yuanheng-zhao commented Apr 20, 2026

Uh oh!

hsliuustc0106 left a comment

Uh oh!

yuanheng-zhao commented Apr 20, 2026

Uh oh!

yuanheng-zhao commented Apr 20, 2026

Uh oh!

hsliuustc0106 left a comment

Uh oh!

Uh oh!

yuanheng-zhao commented Apr 21, 2026

Uh oh!

Uh oh!

amy-why-3459 commented Apr 22, 2026

Uh oh!

yuanheng-zhao commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yuanheng-zhao commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

TTS (standalone talker)

Omni-Speech (thinker + talker)

Test Result

Uh oh!

hsliuustc0106 commented Apr 17, 2026

Uh oh!

yuanheng-zhao commented Apr 20, 2026

Verifications in Recipe

Omni serving (Thinker + Talker)

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

yuanheng-zhao commented Apr 20, 2026

Verifications in Recipe

TTS (standalone talker)

Uh oh!

yuanheng-zhao commented Apr 20, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuanheng-zhao commented Apr 21, 2026

Uh oh!

Uh oh!

amy-why-3459 commented Apr 22, 2026

Uh oh!

yuanheng-zhao commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuanheng-zhao commented Apr 17, 2026 •

edited

Loading