[Model] Add Ming-flash-omni-2.0 Thinker Stage by yuanheng-zhao · Pull Request #1822 · vllm-project/vllm-omni

yuanheng-zhao · 2026-03-11T15:40:23Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR Support Thinker of inclusionAI/Ming-flash-omni-2.0 https://huggingface.co/inclusionAI/Ming-flash-omni-2.0, which accepts audio/image/video/text inputs and generate text.

Supported features:

Thinker LLM (BailingMoeV2Model) Tensor Parallelism
Thinker LLM Pipeline Parallelism
torch compile (this introduces precision difference between eager mode, vllm==0.18, and vllm==0.19)

Modified HF model repo to use: https://huggingface.co/Jonathan1909/Ming-flash-omni-2.0

Model support doc:
https://docs.google.com/document/d/16nTYLjXRFJ7ztk_phV7zZNChWawzdHacA327226QQ7o/edit?usp=sharing

Relates to #1343

Test Plan

Offline e2e example tests
- Text/audio/image/video-only inputs;
- mixed modalities inputs;
- Thinking mode (reasoning);
Online e2e example tests
Model unit test

Test Result

For detailed tests results and comparison, please refer to subsequent comments
#1822 (comment)
#1822 (comment)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

hsliuustc0106 · 2026-03-11T17:11:16Z

huge work from yuanheng :)

SamitHuang · 2026-03-25T04:22:45Z

It's better to add an example for this model

Sure. I'm going to add e2e example as well as tests, and rebase and clean the code this week

congw729 · 2026-03-25T06:32:14Z

What’s the minimum VRAM needed to run this model? / the smallest-end GPU that can run this model?

yuanheng-zhao · 2026-03-25T06:50:29Z

What’s the minimum VRAM needed to run this model? / the smallest-end GPU that can run this model?

@congw729 The major LLM will occupy around 200GB VRAM (bf16); currently I'm using TP 4 on 4 NV H100 80 GB. When incorporating image-gen and talker stages, the whole pipeline parameters occupy around 220-240GB (not counted kv cache in), depending on which DiT model to use.

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

amy-why-3459 · 2026-04-07T06:38:25Z

+
+### Image understanding
+```bash
+python end2end.py --query-type use_image


Is it possible to add an online example as well?

Sure, let me add and test

Added and tested. Partial logs pasted in #1822 (comment)

amy-why-3459 · 2026-04-07T06:58:19Z

+
+# Ported from vllm_omni/model_executor/models/qwen3_tts/tokenizer_25hz/vq/whisper_encoder.py
+# TODO: we might want to extract util functions in future
+def sinusoids(length, channels, max_timescale=10000):


Can we directly import it from qwen3-tts to avoid redefining it? Or can we extract this function as a public function?

Extracted to vllm_omni/model_executor/models/whisper_utils.py for now

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao · 2026-04-08T06:22:01Z

Online Serving

Recorded in examples/online_serving/ming_flash_omni/README.md

Launch the server

vllm serve Jonathan1909/Ming-flash-omni-2.0 --omni --port 8091 --stage-init-timeout 2000 --init-timeout 2000

*If your storage is slow to load the model, increase --stage-init-timeout and --init-timeout

Send request via curl

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Jonathan1909/Ming-flash-omni-2.0",
    "messages": [
      {"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
      {"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
    ],
    "modalities": ["text"]
  }'

# Output
{"id":"chatcmpl-a5d3e30b26fa734e","object":"chat.completion","created":1775628591,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"message":{"role":"assistant","content":"鹦鹉是一类非常有趣且多样化的鸟类，它们的生活习性因种类不同而有所差异，但总体上可以归纳为以下几个方面：\n\n### 1. 社交行为\n鹦鹉是高度社交的鸟类，通常生活在群体中。它们通过复杂的叫声、肢体语言和面部表情进行交流。许多鹦鹉种类会形成长期的伴侣关系，甚至有些种类会终生配对。\n\n### 2. 智力与学习能力\n鹦鹉被认为是世界上最聪明的鸟类之一。它们具有高度的认知能力，能够模仿人类语言和其他声音，甚至理解一些简单的指令和词汇。某些鹦鹉种类，如非洲灰鹦鹉和亚马逊鹦鹉，展示出惊人的记忆力和解决问题的能力。\n\n### 3. 饮食\n鹦鹉的饮食主要包括种子、坚果、水果、花蜜和昆虫。不同种类的鹦鹉有不同的饮食偏好。例如，亚马逊鹦鹉喜欢吃水果，而玄凤鹦鹉则更喜欢种子和坚果。\n\n### 4. 繁殖\n鹦鹉通常在树洞或岩石缝隙中筑巢。雌鸟通常会产下2-4个卵，孵化期因种类不同而异，一般在20-30天之间。幼鸟在孵化后需要父母长时间的照顾，直到它们能够独立觅食。\n\n### 5. 飞行\n大多数鹦鹉是强壮的飞手，能够进行长距离的迁徙。它们的翅膀强健，飞行速度快，且具有很好的灵活性。\n\n### 6. 栖息地\n鹦鹉主要分布在热带和亚热带地区，尤其是南美洲、非洲、亚洲和澳大利亚。不同的种类适应不同的环境，从雨林到草原，再到沙漠边缘。\n\n### 7. 保护状态\n由于栖息地破坏、非法捕捉和贸易，许多鹦鹉种类面临生存威胁。国际自然保护联盟（IUCN）列出了许多鹦鹉物种为濒危或易危。保护措施包括建立自然保护区、立法禁止非法捕捉和贸易，以及人工繁殖计划。\n\n### 8. 特殊行为\n一些鹦鹉种类展示出独特的社会行为，如“梳理”羽毛、互相喂食和玩耍。这些行为不仅有助于维持群体关系，还能提高个体的健康和幸福感。\n\n### 9. 适应性\n鹦鹉具有很强的适应能力，能够在多种环境中生存。它们不仅能适应不同的气候条件，还能学会利用人类提供的食物来源，如农田和城市公园。\n\n总的来说，鹦鹉是非常复杂和有趣的鸟类，它们的生活习性和行为模式为我们提供了丰富的研究对象和观赏乐趣。保护这些美丽的生物是我们共同的责任。","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":38,"total_tokens":518,"completion_tokens":480,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null,"metrics":null}

Send request via curl (Streaming)

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Jonathan1909/Ming-flash-omni-2.0",
    "messages": [
      {"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
      {"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
    ],
    "modalities": ["text"],
    "stream": true
  }'

# partial output
data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null,"modality":"text"}

data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"鹦鹉"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"是一"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"类"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"非常"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"有趣"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

Send request via helper python script

python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py --model Jonathan1909/Ming-flash-omni-2.0 --query-type use_mixed_modalities --port 8091 --host "localhost" --modalities text
Chat completion output from text: The audio clip features a man reciting the nursery rhyme "Mary Had a Little Lamb." The image shows a baby wearing glasses and holding a book, with the Tokyo Skytree visible through cherry blossoms in the background. The video is funny because it humorously portrays a baby mimicking an adult's behavior of reading a book, complete with exaggerated gestures and expressions, while wearing oversized glasses that add to the comedic effect.

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

amy-why-3459 · 2026-04-09T13:53:13Z

+@pytest.mark.omni
+@hardware_test(res={"cuda": "H100"}, num_cards=4)
+@pytest.mark.parametrize("omni_server", test_params, indirect=True)
+def test_image_to_text_001(omni_server, openai_client) -> None:


@yenuo26 PTAL, Please review whether the test cases is reasonable. Is it possible to increase concurrency for image or video input?

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Merges main (which now contains Ming-flash-omni-2.0 thinker (vllm-project#1822) and talker/TTS (vllm-project#2890) stages) into the image-generation feature branch. Signed-off-by: ZhengWG <zwg0606@gmail.com>

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

SamitHuang reviewed Mar 25, 2026

View reviewed changes

yuanheng-zhao added 26 commits March 26, 2026 17:13

add configs and stage yaml

c126213

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

draft: add BailingMoeV2Model

b672354

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

upd

5d2e529

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

draft: add thinker stage audio, vision encoders, connectors

301a3d5

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

upd audio, vision encoders, connectors

69332ae

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

draft: add thinker and omni gen cls

de162a5

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

draft: add processor

45608c2

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

draft: upd processor

aa03a2f

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

draft: upd processor

f27e94e

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

upd and fix

d966861

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

add to model registry

6901d8e

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

hack to fix file not in hf repo

09d9cfe

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

fix name word_embeddings

7a85029

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

fix and register temp processor

5c3273b

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

fix thinker stage weight loading

daf6798

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

adapt to vllm layer Attention

3b5b0be

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

refine the temp hack in arg_util

39aa66f

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

make ming thinker dummy inputs builder ret all modalities

d57c1ae

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

upd

06cb1fb

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

use vllm FusedMoE

67957d3

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Adapt ming configs to transformer_utils configs

264dbb3

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

clean ming configs

84ea1e5

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

trivial revert

b85aa2a

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

trivial upd

da03cf2

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

register omni custom configs to vllm configs registry

16efd7f

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

register tokenizer for custom config

f8cc6b7

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

amy-why-3459 reviewed Apr 7, 2026

View reviewed changes

yuanheng-zhao added 5 commits April 7, 2026 12:16

extract whisper utils

3cfe105

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

trivial: add license header to test

0cc613c

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

fix ming processor modality token counts

6cb13e5

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

upd cu_seqlens as required

32a775c

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Add online serving example and doc (thinker)

fe7ad5c

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

trivial upd readme

b0f91fa

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

yuanheng-zhao requested a review from amy-why-3459 April 9, 2026 12:57

amy-why-3459 reviewed Apr 9, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/stage_configs/bailingmm_moe_v2_lite.yaml Outdated

yuanheng-zhao added 2 commits April 9, 2026 21:48

trivial upd

a4c5ed2

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Merge from main

aeae057

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

amy-why-3459 reviewed Apr 9, 2026

View reviewed changes

yuanheng-zhao added 4 commits April 13, 2026 14:16

Merge from main

d0f7b96

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

cleanup unused config cls

300b56c

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

enforce all sub processors (audio,image/video towers) exist

cfd949c

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

trivial cleanup

121dd76

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

ZhengWG mentioned this pull request Apr 17, 2026

[WIP][Model] Add Ming-flash-omni-2.0 Image Generation (Diffusion) Stage #2875

Open

5 tasks

Merge from main

d11cf28

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

hsliuustc0106 added the ready label to trigger buildkite CI label Apr 17, 2026

hsliuustc0106 approved these changes Apr 17, 2026

View reviewed changes

hsliuustc0106 merged commit c0ccbb8 into vllm-project:main Apr 17, 2026
7 of 8 checks passed

lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026

[Model] Add Ming-flash-omni-2.0 Thinker Stage (vllm-project#1822)

3cba835

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao mentioned this pull request Apr 20, 2026

[New Model]: inclusionAI/Ming-flash-omni-2.0 #1343

Open

1 task

NickCao mentioned this pull request Apr 21, 2026

[Bugfix] treewide: drop references to librosa #2996

Merged

5 tasks

ZhengWG mentioned this pull request Apr 26, 2026

[Model] Add Ming-flash-omni-2.0 Image Generation (Diffusion) Stage ZhengWG/vllm-omni#6

Draft

5 tasks

lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026

[Model] Add Ming-flash-omni-2.0 Thinker Stage (vllm-project#1822)

96b1115

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Conversation

yuanheng-zhao commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

hsliuustc0106 commented Mar 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

congw729 commented Mar 25, 2026

Uh oh!

yuanheng-zhao commented Mar 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanheng-zhao commented Apr 8, 2026

Online Serving

Launch the server

Send request via curl

Send request via curl (Streaming)

Send request via helper python script

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

yuanheng-zhao commented Mar 11, 2026 •

edited

Loading