[Model] Add Ming-flash-omni-2.0 Thinker Stage#1822
Conversation
|
huge work from yuanheng :) |
There was a problem hiding this comment.
It's better to add an example for this model
There was a problem hiding this comment.
Sure. I'm going to add e2e example as well as tests, and rebase and clean the code this week
|
What’s the minimum VRAM needed to run this model? / the smallest-end GPU that can run this model? |
@congw729 The major LLM will occupy around 200GB VRAM (bf16); currently I'm using TP 4 on 4 NV H100 80 GB. When incorporating image-gen and talker stages, the whole pipeline parameters occupy around 220-240GB (not counted kv cache in), depending on which DiT model to use. |
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
|
|
||
| ### Image understanding | ||
| ```bash | ||
| python end2end.py --query-type use_image |
There was a problem hiding this comment.
Is it possible to add an online example as well?
There was a problem hiding this comment.
Sure, let me add and test
There was a problem hiding this comment.
Added and tested. Partial logs pasted in #1822 (comment)
|
|
||
| # Ported from vllm_omni/model_executor/models/qwen3_tts/tokenizer_25hz/vq/whisper_encoder.py | ||
| # TODO: we might want to extract util functions in future | ||
| def sinusoids(length, channels, max_timescale=10000): |
There was a problem hiding this comment.
Can we directly import it from qwen3-tts to avoid redefining it? Or can we extract this function as a public function?
There was a problem hiding this comment.
Extracted to vllm_omni/model_executor/models/whisper_utils.py for now
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Online ServingRecorded in Launch the servervllm serve Jonathan1909/Ming-flash-omni-2.0 --omni --port 8091 --stage-init-timeout 2000 --init-timeout 2000*If your storage is slow to load the model, increase Send request via curlcurl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Jonathan1909/Ming-flash-omni-2.0",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
{"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
],
"modalities": ["text"]
}'
# Output
{"id":"chatcmpl-a5d3e30b26fa734e","object":"chat.completion","created":1775628591,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"message":{"role":"assistant","content":"鹦鹉是一类非常有趣且多样化的鸟类,它们的生活习性因种类不同而有所差异,但总体上可以归纳为以下几个方面:\n\n### 1. 社交行为\n鹦鹉是高度社交的鸟类,通常生活在群体中。它们通过复杂的叫声、肢体语言和面部表情进行交流。许多鹦鹉种类会形成长期的伴侣关系,甚至有些种类会终生配对。\n\n### 2. 智力与学习能力\n鹦鹉被认为是世界上最聪明的鸟类之一。它们具有高度的认知能力,能够模仿人类语言和其他声音,甚至理解一些简单的指令和词汇。某些鹦鹉种类,如非洲灰鹦鹉和亚马逊鹦鹉,展示出惊人的记忆力和解决问题的能力。\n\n### 3. 饮食\n鹦鹉的饮食主要包括种子、坚果、水果、花蜜和昆虫。不同种类的鹦鹉有不同的饮食偏好。例如,亚马逊鹦鹉喜欢吃水果,而玄凤鹦鹉则更喜欢种子和坚果。\n\n### 4. 繁殖\n鹦鹉通常在树洞或岩石缝隙中筑巢。雌鸟通常会产下2-4个卵,孵化期因种类不同而异,一般在20-30天之间。幼鸟在孵化后需要父母长时间的照顾,直到它们能够独立觅食。\n\n### 5. 飞行\n大多数鹦鹉是强壮的飞手,能够进行长距离的迁徙。它们的翅膀强健,飞行速度快,且具有很好的灵活性。\n\n### 6. 栖息地\n鹦鹉主要分布在热带和亚热带地区,尤其是南美洲、非洲、亚洲和澳大利亚。不同的种类适应不同的环境,从雨林到草原,再到沙漠边缘。\n\n### 7. 保护状态\n由于栖息地破坏、非法捕捉和贸易,许多鹦鹉种类面临生存威胁。国际自然保护联盟(IUCN)列出了许多鹦鹉物种为濒危或易危。保护措施包括建立自然保护区、立法禁止非法捕捉和贸易,以及人工繁殖计划。\n\n### 8. 特殊行为\n一些鹦鹉种类展示出独特的社会行为,如“梳理”羽毛、互相喂食和玩耍。这些行为不仅有助于维持群体关系,还能提高个体的健康和幸福感。\n\n### 9. 适应性\n鹦鹉具有很强的适应能力,能够在多种环境中生存。它们不仅能适应不同的气候条件,还能学会利用人类提供的食物来源,如农田和城市公园。\n\n总的来说,鹦鹉是非常复杂和有趣的鸟类,它们的生活习性和行为模式为我们提供了丰富的研究对象和观赏乐趣。保护这些美丽的生物是我们共同的责任。","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":38,"total_tokens":518,"completion_tokens":480,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null,"metrics":null}Send request via curl (Streaming)curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Jonathan1909/Ming-flash-omni-2.0",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
{"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
],
"modalities": ["text"],
"stream": true
}'
# partial output
data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null,"modality":"text"}
data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"鹦鹉"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}
data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"是一"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}
data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"类"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}
data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"非常"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}
data: {"id":"chatcmpl-ad347520e2634861","object":"chat.completion.chunk","created":1775629091,"model":"Jonathan1909/Ming-flash-omni-2.0","choices":[{"index":0,"delta":{"content":"有趣"},"logprobs":null,"finish_reason":null,"token_ids":null}],"modality":"text","metrics":{}}
Send request via helper python scriptpython examples/online_serving/openai_chat_completion_client_for_multimodal_generation.py --model Jonathan1909/Ming-flash-omni-2.0 --query-type use_mixed_modalities --port 8091 --host "localhost" --modalities text
Chat completion output from text: The audio clip features a man reciting the nursery rhyme "Mary Had a Little Lamb." The image shows a baby wearing glasses and holding a book, with the Tokyo Skytree visible through cherry blossoms in the background. The video is funny because it humorously portrays a baby mimicking an adult's behavior of reading a book, complete with exaggerated gestures and expressions, while wearing oversized glasses that add to the comedic effect. |
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
| @pytest.mark.omni | ||
| @hardware_test(res={"cuda": "H100"}, num_cards=4) | ||
| @pytest.mark.parametrize("omni_server", test_params, indirect=True) | ||
| def test_image_to_text_001(omni_server, openai_client) -> None: |
There was a problem hiding this comment.
@yenuo26 PTAL, Please review whether the test cases is reasonable. Is it possible to increase concurrency for image or video input?
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Merges main (which now contains Ming-flash-omni-2.0 thinker (vllm-project#1822) and talker/TTS (vllm-project#2890) stages) into the image-generation feature branch. Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
This PR Support Thinker of inclusionAI/Ming-flash-omni-2.0 https://huggingface.co/inclusionAI/Ming-flash-omni-2.0, which accepts audio/image/video/text inputs and generate text.
Supported features:
vllm==0.18, andvllm==0.19)Modified HF model repo to use: https://huggingface.co/Jonathan1909/Ming-flash-omni-2.0
Model support doc:
https://docs.google.com/document/d/16nTYLjXRFJ7ztk_phV7zZNChWawzdHacA327226QQ7o/edit?usp=sharing
Relates to #1343
Test Plan
Test Result
For detailed tests results and comparison, please refer to subsequent comments
#1822 (comment)
#1822 (comment)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)