Skip to content

[Model] Add Ming-flash-omni-2.0 Thinker Stage#1822

Merged
hsliuustc0106 merged 72 commits into
vllm-project:mainfrom
yuanheng-zhao:model/ming-omni
Apr 17, 2026
Merged

[Model] Add Ming-flash-omni-2.0 Thinker Stage#1822
hsliuustc0106 merged 72 commits into
vllm-project:mainfrom
yuanheng-zhao:model/ming-omni

Conversation

@yuanheng-zhao
Copy link
Copy Markdown
Collaborator

@yuanheng-zhao yuanheng-zhao commented Mar 11, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR Support Thinker of inclusionAI/Ming-flash-omni-2.0 https://huggingface.co/inclusionAI/Ming-flash-omni-2.0, which accepts audio/image/video/text inputs and generate text.

Supported features:

  • Thinker LLM (BailingMoeV2Model) Tensor Parallelism
  • Thinker LLM Pipeline Parallelism
  • torch compile (this introduces precision difference between eager mode, vllm==0.18, and vllm==0.19)

Modified HF model repo to use: https://huggingface.co/Jonathan1909/Ming-flash-omni-2.0

Model support doc:
https://docs.google.com/document/d/16nTYLjXRFJ7ztk_phV7zZNChWawzdHacA327226QQ7o/edit?usp=sharing

Relates to #1343

Test Plan

  • Offline e2e example tests
    • Text/audio/image/video-only inputs;
    • mixed modalities inputs;
    • Thinking mode (reasoning);
  • Online e2e example tests
  • Model unit test

Test Result

For detailed tests results and comparison, please refer to subsequent comments
#1822 (comment)
#1822 (comment)


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

huge work from yuanheng :)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add an example for this model

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'm going to add e2e example as well as tests, and rebase and clean the code this week

@congw729
Copy link
Copy Markdown
Collaborator

What’s the minimum VRAM needed to run this model? / the smallest-end GPU that can run this model?

@yuanheng-zhao
Copy link
Copy Markdown
Collaborator Author

What’s the minimum VRAM needed to run this model? / the smallest-end GPU that can run this model?

@congw729 The major LLM will occupy around 200GB VRAM (bf16); currently I'm using TP 4 on 4 NV H100 80 GB. When incorporating image-gen and talker stages, the whole pipeline parameters occupy around 220-240GB (not counted kv cache in), depending on which DiT model to use.

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

### Image understanding
```bash
python end2end.py --query-type use_image
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add an online example as well?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me add and test

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added and tested. Partial logs pasted in #1822 (comment)


# Ported from vllm_omni/model_executor/models/qwen3_tts/tokenizer_25hz/vq/whisper_encoder.py
# TODO: we might want to extract util functions in future
def sinusoids(length, channels, max_timescale=10000):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we directly import it from qwen3-tts to avoid redefining it? Or can we extract this function as a public function?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to vllm_omni/model_executor/models/whisper_utils.py for now

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
@yuanheng-zhao
Copy link
Copy Markdown
Collaborator Author

Online Serving

Recorded in examples/online_serving/ming_flash_omni/README.md

Launch the server

vllm serve Jonathan1909/Ming-flash-omni-2.0 --omni --port 8091 --stage-init-timeout 2000 --init-timeout 2000

*If your storage is slow to load the model, increase --stage-init-timeout and --init-timeout

Send request via curl

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Jonathan1909/Ming-flash-omni-2.0",
    "messages": [
      {"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
      {"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
    ],
    "modalities": ["text"]
  }'

# Output
{"id":"chatcmpl-a5d3e30b26fa734e","object":"chat.completion","created":1775628591,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"message":{"role":"assistant","content":"鹦鹉是一类非常有趣且多样化的鸟类,它们的生活习性因种类不同而有所差异,但总体上可以归纳为以下几个方面:\n\n### 1. 社交行为\n鹦鹉是高度社交的鸟类,通常生活在群体中。它们通过复杂的叫声、肢体语言和面部表情进行交流。许多鹦鹉种类会形成长期的伴侣关系,甚至有些种类会终生配对。\n\n### 2. 智力与学习能力\n鹦鹉被认为是世界上最聪明的鸟类之一。它们具有高度的认知能力,能够模仿人类语言和其他声音,甚至理解一些简单的指令和词汇。某些鹦鹉种类,如非洲灰鹦鹉和亚马逊鹦鹉,展示出惊人的记忆力和解决问题的能力。\n\n### 3. 饮食\n鹦鹉的饮食主要包括种子、坚果、水果、花蜜和昆虫。不同种类的鹦鹉有不同的饮食偏好。例如,亚马逊鹦鹉喜欢吃水果,而玄凤鹦鹉则更喜欢种子和坚果。\n\n### 4. 繁殖\n鹦鹉通常在树洞或岩石缝隙中筑巢。雌鸟通常会产下2-4个卵,孵化期因种类不同而异,一般在20-30天之间。幼鸟在孵化后需要父母长时间的照顾,直到它们能够独立觅食。\n\n### 5. 飞行\n大多数鹦鹉是强壮的飞手,能够进行长距离的迁徙。它们的翅膀强健,飞行速度快,且具有很好的灵活性。\n\n### 6. 栖息地\n鹦鹉主要分布在热带和亚热带地区,尤其是南美洲、非洲、亚洲和澳大利亚。不同的种类适应不同的环境,从雨林到草原,再到沙漠边缘。\n\n### 7. 保护状态\n由于栖息地破坏、非法捕捉和贸易,许多鹦鹉种类面临生存威胁。国际自然保护联盟(IUCN)列出了许多鹦鹉物种为濒危或易危。保护措施包括建立自然保护区、立法禁止非法捕捉和贸易,以及人工繁殖计划。\n\n### 8. 特殊行为\n一些鹦鹉种类展示出独特的社会行为,如“梳理”羽毛、互相喂食和玩耍。这些行为不仅有助于维持群体关系,还能提高个体的健康和幸福感。\n\n### 9. 适应性\n鹦鹉具有很强的适应能力,能够在多种环境中生存。它们不仅能适应不同的气候条件,还能学会利用人类提供的食物来源,如农田和城市公园。\n\n总的来说,鹦鹉是非常复杂和有趣的鸟类,它们的生活习性和行为模式为我们提供了丰富的研究对象和观赏乐趣。保护这些美丽的生物是我们共同的责任。","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":38,"total_tokens":518,"completion_tokens":480,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null,"metrics":null}

Send request via curl (Streaming)

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Jonathan1909/Ming-flash-omni-2.0",
    "messages": [
      {"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
      {"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
    ],
    "modalities": ["text"],
    "stream": true
  }'

# partial output
data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null,"modality":"text"}

data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"鹦鹉"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"是一"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"非常"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"有趣"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}

Send request via helper python script

python examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py --model Jonathan1909/Ming-flash-omni-2.0 --query-type use_mixed_modalities --port 8091 --host "localhost" --modalities text
Chat completion output from text: The audio clip features a man reciting the nursery rhyme "Mary Had a Little Lamb." The image shows a baby wearing glasses and holding a book, with the Tokyo Skytree visible through cherry blossoms in the background. The video is funny because it humorously portrays a baby mimicking an adult's behavior of reading a book, complete with exaggerated gestures and expressions, while wearing oversized glasses that add to the comedic effect.

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Comment thread vllm_omni/model_executor/stage_configs/bailingmm_moe_v2_lite.yaml Outdated
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
@pytest.mark.omni
@hardware_test(res={"cuda": "H100"}, num_cards=4)
@pytest.mark.parametrize("omni_server", test_params, indirect=True)
def test_image_to_text_001(omni_server, openai_client) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yenuo26 PTAL, Please review whether the test cases is reasonable. Is it possible to increase concurrency for image or video input?

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Apr 17, 2026
@hsliuustc0106 hsliuustc0106 merged commit c0ccbb8 into vllm-project:main Apr 17, 2026
7 of 8 checks passed
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
ZhengWG added a commit to ZhengWG/vllm-omni that referenced this pull request Apr 28, 2026
Merges main (which now contains Ming-flash-omni-2.0 thinker (vllm-project#1822) and
talker/TTS (vllm-project#2890) stages) into the image-generation feature branch.

Signed-off-by: ZhengWG <zwg0606@gmail.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants