Skip to content

[Model] Ming-flash-omni-2.0 Omni-Speech and TTS#2890

Merged
hsliuustc0106 merged 43 commits intovllm-project:mainfrom
yuanheng-zhao:model/ming-omni-talker-draft
Apr 23, 2026
Merged

[Model] Ming-flash-omni-2.0 Omni-Speech and TTS#2890
hsliuustc0106 merged 43 commits intovllm-project:mainfrom
yuanheng-zhao:model/ming-omni-talker-draft

Conversation

@yuanheng-zhao
Copy link
Copy Markdown
Contributor

@yuanheng-zhao yuanheng-zhao commented Apr 17, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Omni-Speech (thinker + talker stages) and TTS support for #1343

cc @LHXuuu , @ZhengWG

flow diagram

Test Plan

TTS (standalone talker)

# offline examples
python examples/offline_inference/ming_flash_omni_tts/end2end.py --case style
python examples/offline_inference/ming_flash_omni_tts/end2end.py --case ip
python examples/offline_inference/ming_flash_omni_tts/end2end.py --case basic

# online serving TTS
MODEL="/workspace/models/Ming-flash-omni-2.0"
PORT="8091"
STAGE_CONFIG="vllm_omni/model_executor/stage_configs/ming_flash_omni_tts.yaml"
vllm serve "$MODEL" \
    --stage-configs-path "$STAGE_CONFIG" \
    --port "$PORT" \
    --omni \
    --log-stats

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/workspace/models/Ming-flash-omni-2.0",
        "input": "我会一直在这里陪着你,直到你慢慢、慢慢地沉入那个最温柔的梦里……好吗?",
        "response_format": "wav"
    }' --output ming_tts_online.wav

curl -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/workspace/models/Ming-flash-omni-2.0",
    "input": "春天来了,万物复苏,大地一片生机盎然。田野里的油菜花开得金灿灿的,蜜蜂在花丛中忙碌地采蜜。远处的山坡上,桃花和杏花竞相绽放,粉的白的交织在一起,美不胜收。清晨的微风带着泥土的芬芳,轻轻拂过脸颊,让人感到无比惬意。孩子们在田间小路上追逐嬉戏,老人们坐在门前晒太阳,享受着这份宁静与美好。",
    "speaker": "lingguang",
    "response_format": "wav"
  }' --output ming_tts_online_long_lingguang.wav

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "今日天气真系好靓,出去行下啦!",
    "instructions": "{\"方言\": \"粤语\", \"情感\": \"开心\"}",
    "response_format": "wav"
  }' --output output_cantonese.wav

Omni-Speech (thinker + talker)

# offline example
python examples/offline_inference/ming_flash_omni/end2end.py \
    --query-type use_image \
    --stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml \
    --modalities text,audio

python examples/offline_inference/ming_flash_omni/end2end.py \
    --query-type use_video \
    --stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml \
    --modalities text,audio

python examples/offline_inference/ming_flash_omni/end2end.py \
    --query-type use_audio \
    --stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml \
    --modalities text,audio

# online speech serving
MODEL="/workspace/models/Ming-flash-omni-2.0"
PORT="8091"
STAGE_CONFIG="vllm_omni/model_executor/stage_configs/ming_flash_omni.yaml"
vllm serve "$MODEL" \
    --stage-configs-path "$STAGE_CONFIG" \
    --port "$PORT" \
    --omni \
    --log-stats

# send request and extract audio output
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "/workspace/models/Ming-flash-omni-2.0",
  "modalities": ["audio"],
  "messages": [
	{
		"role": "system", 
		"content": [
			{
				"type": "text", 
				"text": "你是一个友好的AI助手。\n\ndetailed thinking off"
			}
		]
	},
	{"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
  ]
}' | jq -r '.choices[0].message.audio.data' | base64 -d > output_omni_parrot.wav

Please see my subsequent comments

Test Result

#2890 (comment)
#2890 (comment)


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

yuanheng-zhao and others added 17 commits April 17, 2026 14:07
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
…e text segmentation boundaries

Signed-off-by: LHXuuu <xulianhao.xlh@antgroup.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

This PR is marked as [Do-Not-Merge]. Ready for full review when the Do-Not-Merge label is removed.

Note: When ready for review, this PR will need:

  1. Pre-commit and docs checks to pass
  2. L3 tests run locally (since this adds ~3300 LOC for a new model)
  3. Complete test plan and results in PR description

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
yuanheng-zhao and others added 6 commits April 19, 2026 13:02
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: LHXuuu <xulianhao.xlh@antgroup.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao yuanheng-zhao changed the title [Do-Not-Merge][Model] Ming-flash-omni-2.0 Omni-Speech and TTS [Model] Ming-flash-omni-2.0 Omni-Speech and TTS Apr 19, 2026
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

Verifications in Recipe

Omni serving (Thinker + Talker)

Curl with image input:

curl http://localhost:8091/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Jonathan1909/Ming-flash-omni-2.0",
      "messages": [
        {"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
        {"role": "user", "content": [
          {"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg"}},
          {"type": "text", "text": "Describe this image in detail."}
        ]}
      ],
      "modalities": ["text"]
    }'

Output:

{"id":"chatcmpl-b1bd1f1a80dcb93e","object":"chat.completion","created":1776693638,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"message":{"role":"assistant","content":"The image captures a striking view of the Tokyo Skytree, framed by the delicate pink blossoms of cherry trees in full bloom. The iconic tower stands tall against a clear blue sky, with its intricate lattice structure and modern design visible through the branches. The vibrant contrast between the soft pink flowers and the bright blue backdrop creates a picturesque scene that highlights both natural beauty and architectural marvel.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":2076,"total_tokens":2153,"completion_tokens":77,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null,"metrics":null}

Curl with modalities audio and save wav file:

curl http://localhost:8091/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Jonathan1909/Ming-flash-omni-2.0",
      "messages": [
        {"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
        {"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
      ],
      "modalities": ["audio"]
    }' | jq -r '.choices[0].message.audio.data' | base64 -d > ming_omni_parrot.wav

Output:

ming_omni_parrot.wav

Curl with audio input (ASR) and output both text and audio:

curl http://localhost:8091/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Jonathan1909/Ming-flash-omni-2.0",
      "messages": [
        {"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
        {"role": "user", "content": [
          {"type": "audio_url", "audio_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg"}},
          {"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."}
        ]}
      ],
      "modalities": ["text", "audio"]
    }' | jq -r '.choices[0].message.content'

Output:

English the first words i spoke in the original phonograph a little piece of practical poetry mary had a little lamb its fleece was white as snow and everywhere that mary went the lamb was sure to go

Curl with streaming mode:

curl -N http://localhost:8091/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Jonathan1909/Ming-flash-omni-2.0",
      "messages": [
        {"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
        {"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
      ],
      "modalities": ["text"],
      "stream": true
    }'

Output (partial):

data: {"id":"chatcmpl-8d5652aa82bc8a0f","object":"chat.completion.chunk","created":1776694471,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":",以确保"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-8d5652aa82bc8a0f","object":"chat.completion.chunk","created":1776694471,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"它们的"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-8d5652aa82bc8a0f","object":"chat.completion.chunk","created":1776694471,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"健康和"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-8d5652aa82bc8a0f","object":"chat.completion.chunk","created":1776694471,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"幸福"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-8d5652aa82bc8a0f","object":"chat.completion.chunk","created":1776694471,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"。"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-8d5652aa82bc8a0f","object":"chat.completion.chunk","created":1776694471,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: [DONE]

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change to the cli overrides and rm the yamls after #2383

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

Verifications in Recipe

TTS (standalone talker)

Curl cmd:

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Jonathan1909/Ming-flash-omni-2.0",
      "input": "我会一直在这里陪着你。",
      "response_format": "wav"
    }' --output ming_online.wav

Output:

ming_online.wav

Curl with speaker lingguang:

curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Jonathan1909/Ming-flash-omni-2.0",
      "input": "春天来了,万物复苏,大地一片生机盎然。田野里的油菜花开得金灿灿的,蜜蜂在花丛中忙碌地采蜜。远处的山坡上,桃花和杏花竞相绽放,粉的白的交织在一起,美不胜收。清晨的微风带着泥土的芬芳,轻轻拂过脸颊,让人感到无比惬意。孩子们在田间小路上追逐嬉戏,老人们坐在门前晒太阳,享受着这份宁静与美好。",
      "speaker": "lingguang",
      "response_format": "wav"
    }' --output ming_online_lingguang.wav

Output:

ming_online_lingguang.wav

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

can we change to the cli overrides and rm the yamls after #2383

@hsliuustc0106 As the model features are ready to use, shall we merge the supporting of talker (omni-speech, TTS) (this PR) and the image generation (#2875) first? Converting to the new config system require adapting and testing whole test cases on both PRs which might postpone our functionalities.

I could update the config system for all of Ming related features in a subsequent PR:

  • ASR (thinker only)
  • TTS (talker only)
  • omni serving (multi modality inputs, omni-speech)
  • image-gen (multi modality inputs, image output)

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Apr 21, 2026
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we update the docs?

Comment thread tests/e2e/online_serving/test_ming_flash_omni_expansion.py Outdated
@hsliuustc0106 hsliuustc0106 added the omni-test label to trigger buildkite omni model test in nightly CI label Apr 21, 2026
@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

do we update the docs?

I updated docs in these commits: https://github.com/vllm-project/vllm-omni/pull/2890/changes/faea882515db83b08fed34b2b63cbb3dd79dc636..0b1c1059106f67baeaee6790e1f891ac82fce3bf

Basically I trimmed up examples and moved most-representative cases to corresponding recipe. Any suggestion to further trim example directories?

yuanheng-zhao and others added 3 commits April 21, 2026 08:41
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Comment thread vllm_omni/model_executor/models/ming_flash_omni/talker_modules/__init__.py Outdated
@amy-why-3459
Copy link
Copy Markdown
Contributor

LGTM

@yuanheng-zhao
Copy link
Copy Markdown
Contributor Author

cc @hsliuustc0106

@hsliuustc0106 hsliuustc0106 merged commit 3a32cd6 into vllm-project:main Apr 23, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

omni-test label to trigger buildkite omni model test in nightly CI ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants