-
Notifications
You must be signed in to change notification settings - Fork 17
feat: add qwen2.5-omni #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @weedge, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request introduces a new feature to run Qwen2.5-Omni, a multimodal model, on Modal. It includes the necessary code to set up the environment, load the model, and run inference on various tasks such as audio understanding, voice chatting, and video information extraction. The pull request also adds a script to automate the process of downloading models and assets, as well as running the different test cases.
Highlights
- Qwen2.5-Omni Integration: Adds support for running Qwen2.5-Omni, a multimodal model, on the Modal platform.
- Task Demos: Includes demos for various tasks such as universal audio understanding, voice chatting, video information extraction, screen recording interaction, and omni-chatting for math and music.
- Automated Script: Introduces a shell script to automate downloading models/assets and running test cases with different configurations (GPU, task, etc.).
Changelog
Click here to see the changelog
- .gitignore
- Added *.mp4 to the ignored files list on line 165.
- Ensured *.xml is included in the git repository on line 210.
- deploy/modal/src/llm/transformers/qwen2_5omni.py
- Introduces a new Modal app for Qwen2.5-Omni.
- Sets up the environment with necessary dependencies (transformers, torch, flash-attn, etc.).
- Defines functions for various multimodal tasks, including audio understanding, voice chatting, and video information extraction.
- Implements an inference function to process and generate responses based on different input types (audio, images, videos).
- Adds a main function to run the different tasks based on user input.
- deploy/modal/src/llm/transformers/run_omni_cases.sh
- Introduces a shell script to automate the process of downloading models and assets.
- Provides command-line arguments to configure the GPU, task, model type, and transformers commit.
- Includes functions to run different test cases with specified configurations.
- Adds a usage function to display help information.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
A model of sight and sound,
Qwen's Omni, profound.
Modal's cloud takes flight,
Processing day and night,
New AI wonders abound.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the Qwen2.5-Omni model to the modal deployment, including necessary dependencies, configuration, and example tasks. The code appears well-structured and includes several example use cases. However, there are a few areas that could be improved for clarity and maintainability.
Summary of Findings
- Missing Error Handling: The
subprocess.runcalls inqwen2_5omni.pylack error handling. If these commands fail, the script will continue, potentially leading to incorrect results or unexpected behavior. Consider adding error checking to ensure the commands execute successfully. - Hardcoded Paths: The paths
HF_MODEL_DIRandASSETS_DIRare hardcoded inqwen2_5omni.py. It would be better to make these configurable via environment variables to allow for more flexible deployment. - Inconsistent Use of
use_audio_in_video: Theuse_audio_in_videoparameter is used inconsistently across different function calls inqwen2_5omni.py. Ensure that this parameter is used correctly and consistently to avoid unexpected behavior.
Merge Readiness
The pull request introduces a significant new feature and includes example tasks, which is commendable. However, the missing error handling and hardcoded paths should be addressed before merging. I am unable to directly approve this pull request, and recommend that other reviewers also examine this code before merging. At a minimum, the high severity issues should be addressed before merging.
|
Qwen2.5Omni: 10732.225408 M parameters |
88d844e to
ac829c8
Compare
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
…alGenerationNew class inherit from Qwen2_5OmniForConditionalGeneration Signed-off-by: weedge <[email protected]>
… to generate fast Signed-off-by: weedge <[email protected]>
…ment_stream Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
…den_states Signed-off-by: weedge <[email protected]>
…den_states Signed-off-by: weedge <[email protected]>
…rmersManualQwen2_5OmniLLM Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
- thinekr_genrate_chunk hidden_states_len for modality embedding stream cases: - screen_recording_interaction_stream - screen_recording_interaction_chunk_stream - video_information_extracting_stream - video_information_extracting_chunk_stream - omni_chatting_for_math_stream - omni_chatting_for_music_stream - omni_chatting_for_math_chunk_stream - omni_chatting_for_music_chunk_stream Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
…vekitQwen2_5OmniVisionVoiceBot and config deploy on modal Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
…indow code2wav with achatbot lib Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
…er tasks to run Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
feat:
add qwen2_code2wav streaming from vllm, change it for achatbot (cfm dit + bigvgan) (maybe add zmq as connector)
add qwen2_5omni_asr and unit test
LLM_MODEL_NAME_OR_PATH=./models/Qwen/Qwen2.5-Omni-7B \ THINKER_LLM_GEN_TEMPERATURE=0.9 \ LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 \ python -m unittest test.modules.speech.asr.test_qwen2_5omni_asr.TestQwen2_5OmniASR.test_transcribe_stream LLM_MODEL_NAME_OR_PATH=./models/Qwen/Qwen2.5-Omni-7B \ THINKER_LLM_GEN_TEMPERATURE=0.9 \ LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 \ python -m unittest test.modules.speech.asr.test_qwen2_5omni_asr.TestQwen2_5OmniASR.test_transcribe LLM_MODEL_NAME_OR_PATH=./models/Qwen/Qwen2.5-Omni-7B \ THINKER_LLM_GEN_TEMPERATURE=0.9 \ LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 \ python -m unittest test.modules.speech.asr.test_qwen2_5omni_asr.TestQwen2_5OmniASR.test_transcribe_with_bytesdeploy modal fastapi-webrtc serve to run personal bot one by one
run webrtc_qwen2_5omni_vision_voice_bot serve with webrtc
# webrtc_vision_bot serve on qwen2.5omni vision llm IMAGE_NAME=qwen2.5omni IMAGE_CONCURRENT_CN=1 IMAGE_GPU=L4 modal serve -e achatbot src/fastapi_webrtc_vision_bot_serve.pycurl api to run chat room bot with webrtc (daily/livekit/agora) use livekit_room
run webrtc_qwen2_5omni_vision_voice_bot serve with webrtc
curl api to run chat room bot with webrtc (livekit_room)
TMRoPE (Time-aligned Multimodal RoPE):
talker -> code --> cfm dit->mel --> bigvgan -> waveforms streaming: (generate source code see vllm)
reference
code2wav
inference
Tip
Note
use vllm inference: