[Model] Ming-flash-omni-2.0 Omni-Speech and TTS#2890
[Model] Ming-flash-omni-2.0 Omni-Speech and TTS#2890hsliuustc0106 merged 43 commits intovllm-project:mainfrom
Conversation
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
…e text segmentation boundaries Signed-off-by: LHXuuu <xulianhao.xlh@antgroup.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
|
This PR is marked as [Do-Not-Merge]. Ready for full review when the Do-Not-Merge label is removed. Note: When ready for review, this PR will need:
|
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: LHXuuu <xulianhao.xlh@antgroup.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Verifications in RecipeOmni serving (Thinker + Talker)Curl with image input: curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Jonathan1909/Ming-flash-omni-2.0",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg"}},
{"type": "text", "text": "Describe this image in detail."}
]}
],
"modalities": ["text"]
}'Output: Curl with modalities audio and save wav file: curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Jonathan1909/Ming-flash-omni-2.0",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
{"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
],
"modalities": ["audio"]
}' | jq -r '.choices[0].message.audio.data' | base64 -d > ming_omni_parrot.wavOutput: Curl with audio input (ASR) and output both text and audio: curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Jonathan1909/Ming-flash-omni-2.0",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
{"role": "user", "content": [
{"type": "audio_url", "audio_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg"}},
{"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."}
]}
],
"modalities": ["text", "audio"]
}' | jq -r '.choices[0].message.content'Output: Curl with streaming mode: curl -N http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Jonathan1909/Ming-flash-omni-2.0",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "你是一个友好的AI助手。\n\ndetailed thinking off"}]},
{"role": "user", "content": "请详细介绍鹦鹉的生活习性。"}
],
"modalities": ["text"],
"stream": true
}'Output (partial): |
hsliuustc0106
left a comment
There was a problem hiding this comment.
can we change to the cli overrides and rm the yamls after #2383
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Verifications in RecipeTTS (standalone talker)Curl cmd: curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "Jonathan1909/Ming-flash-omni-2.0",
"input": "我会一直在这里陪着你。",
"response_format": "wav"
}' --output ming_online.wavOutput: Curl with speaker curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "Jonathan1909/Ming-flash-omni-2.0",
"input": "春天来了,万物复苏,大地一片生机盎然。田野里的油菜花开得金灿灿的,蜜蜂在花丛中忙碌地采蜜。远处的山坡上,桃花和杏花竞相绽放,粉的白的交织在一起,美不胜收。清晨的微风带着泥土的芬芳,轻轻拂过脸颊,让人感到无比惬意。孩子们在田间小路上追逐嬉戏,老人们坐在门前晒太阳,享受着这份宁静与美好。",
"speaker": "lingguang",
"response_format": "wav"
}' --output ming_online_lingguang.wavOutput: |
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
@hsliuustc0106 As the model features are ready to use, shall we merge the supporting of talker (omni-speech, TTS) (this PR) and the image generation (#2875) first? Converting to the new config system require adapting and testing whole test cases on both PRs which might postpone our functionalities. I could update the config system for all of Ming related features in a subsequent PR:
|
hsliuustc0106
left a comment
There was a problem hiding this comment.
do we update the docs?
I updated docs in these commits: https://github.com/vllm-project/vllm-omni/pull/2890/changes/faea882515db83b08fed34b2b63cbb3dd79dc636..0b1c1059106f67baeaee6790e1f891ac82fce3bf Basically I trimmed up examples and moved most-representative cases to corresponding recipe. Any suggestion to further trim example directories? |
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
|
LGTM |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Omni-Speech (thinker + talker stages) and TTS support for #1343
cc @LHXuuu , @ZhengWG
Test Plan
TTS (standalone talker)
Omni-Speech (thinker + talker)
Please see my subsequent comments
Test Result
#2890 (comment)
#2890 (comment)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)