[examples][model] fix: add Qwen2.5-Omni examples and fix vision encoder inference#2965
[examples][model] fix: add Qwen2.5-Omni examples and fix vision encoder inference#2965
Conversation
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…i thinker model Qwen2_5OmniVisionEncoder.forward() returns BaseModelOutputWithPooling, not a plain tensor. Extract .pooler_output (the merger-projected features) before assigning to combined_embeddings to fix inference crash: TypeError: can't assign a BaseModelOutputWithPooling to a BFloat16Tensor Also update README to clarify qwen-omni-utils install command and note that --use_audio_in_video requires ffmpeg on the system. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…sage notes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test aba5e9c |
📝 WalkthroughWalkthroughThe PR adds documentation and tooling for the Qwen2.5-Omni vision-language model example, including a README with setup instructions, checkpoint conversion workflow orchestration, multi-GPU inference scripts, and fixes the thinker model to extract pooled visual embeddings instead of raw encoder outputs. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/models/vlm/qwen25_omni/inference.sh`:
- Around line 25-53: The script passes --video_url together with
--use_audio_in_video which violates the audio extraction flow; update the three
invocations that call examples/conversion/hf_to_megatron_generate_omni_lm.py to
use --video_path (local file) instead of --video_url when --use_audio_in_video
is present (or remove --use_audio_in_video if you intend to use a remote URL);
specifically change the calls that include --video_url and --use_audio_in_video
so they provide --video_path "${VIDEO_PATH}" (and ensure the environment
variable VIDEO_PATH is set) for the HF, Megatron, and exported-HF blocks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f53bab57-c997-4f8f-9914-c5462c6239ab
📒 Files selected for processing (4)
examples/models/vlm/qwen25_omni/README.mdexamples/models/vlm/qwen25_omni/conversion.shexamples/models/vlm/qwen25_omni/inference.shsrc/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/thinker_model.py
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test a08769a |
Summary
Follow-up to #2634 (Qwen2.5-Omni model support).
Qwen/Qwen2.5-Omni-7B(examples/models/vlm/qwen25_omni/)thinker_model.py:Qwen2_5OmniVisionEncoder.forward()returnsBaseModelOutputWithPooling— extract.pooler_output(merger-projected features) before injecting into combined embeddingsimageio-ffmpeg) and the--video_pathrequirement for--use_audio_in_videoTest plan
EXIT=0on cluster--use_audio_in_video: coherent audio-grounded output🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Documentation
Improvements