Skip to content

[examples][model] fix: add Qwen2.5-Omni examples and fix vision encoder inference#2965

Merged
yaoyu-33 merged 4 commits intomainfrom
yuya/qwen25-omni-examples-fix
Mar 24, 2026
Merged

[examples][model] fix: add Qwen2.5-Omni examples and fix vision encoder inference#2965
yaoyu-33 merged 4 commits intomainfrom
yuya/qwen25-omni-examples-fix

Conversation

@yaoyu-33
Copy link
Copy Markdown
Contributor

@yaoyu-33 yaoyu-33 commented Mar 23, 2026

Summary

Follow-up to #2634 (Qwen2.5-Omni model support).

  • Add conversion and inference example scripts + README for Qwen/Qwen2.5-Omni-7B (examples/models/vlm/qwen25_omni/)
  • Fix inference crash in thinker_model.py: Qwen2_5OmniVisionEncoder.forward() returns BaseModelOutputWithPooling — extract .pooler_output (merger-projected features) before injecting into combined embeddings
  • Document ffmpeg setup (imageio-ffmpeg) and the --video_path requirement for --use_audio_in_video

Test plan

  • Conversion HF → Megatron: EXIT=0 on cluster
  • Inference with HF checkpoint (video-only): coherent output
  • Inference with imported Megatron checkpoint (video-only): matches HF output
  • Inference with local video + --use_audio_in_video: coherent audio-grounded output

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added checkpoint conversion tools supporting Hugging Face and Megatron format interchange with multi-GPU validation
    • Added distributed inference script for Qwen2.5-Omni model with audio-video support
  • Documentation

    • Added comprehensive setup and usage documentation for Qwen2.5-Omni, including prerequisites and configuration guidance
  • Improvements

    • Enhanced vision/audio embedding extraction in the Qwen2.5-Omni model

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…i thinker model

Qwen2_5OmniVisionEncoder.forward() returns BaseModelOutputWithPooling,
not a plain tensor. Extract .pooler_output (the merger-projected features)
before assigning to combined_embeddings to fix inference crash:
  TypeError: can't assign a BaseModelOutputWithPooling to a BFloat16Tensor

Also update README to clarify qwen-omni-utils install command and note
that --use_audio_in_video requires ffmpeg on the system.

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…sage notes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33
Copy link
Copy Markdown
Contributor Author

/ok to test aba5e9c

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 23, 2026

📝 Walkthrough

Walkthrough

The PR adds documentation and tooling for the Qwen2.5-Omni vision-language model example, including a README with setup instructions, checkpoint conversion workflow orchestration, multi-GPU inference scripts, and fixes the thinker model to extract pooled visual embeddings instead of raw encoder outputs.

Changes

Cohort / File(s) Summary
Documentation
examples/models/vlm/qwen25_omni/README.md
Introduces Qwen2.5-Omni example directory with model metadata, prerequisites for audio/video support, workspace configuration, and documentation of checkpoint conversion (HF→Megatron import/export) and inference commands.
Checkpoint Conversion Tooling
examples/models/vlm/qwen25_omni/conversion.sh
Bash script orchestrating three-stage checkpoint conversion workflow: HF to Megatron import, Megatron to HF export, and multi-GPU round-trip validation using tensor/pipe parallel configuration.
Inference Orchestration
examples/models/vlm/qwen25_omni/inference.sh
Bash script executing distributed multi-GPU inference across three checkpoint sources: native HF, imported Megatron, and exported HF checkpoints, with audio-video feature extraction and token generation.
Model Implementation
src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/thinker_model.py
Updated vision/audio embedding extraction to use pooler_output from visual encoder instead of raw encoder outputs for downstream token replacement.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the two main changes: adding Qwen2.5-Omni examples and fixing vision encoder inference with pooler_output extraction.
Test Results For Major Changes ✅ Passed PR describes comprehensive test results including conversion validation (EXIT=0), inference testing with HF and Megatron checkpoints, output consistency verification, and audio-visual feature testing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yuya/qwen25-omni-examples-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/models/vlm/qwen25_omni/inference.sh`:
- Around line 25-53: The script passes --video_url together with
--use_audio_in_video which violates the audio extraction flow; update the three
invocations that call examples/conversion/hf_to_megatron_generate_omni_lm.py to
use --video_path (local file) instead of --video_url when --use_audio_in_video
is present (or remove --use_audio_in_video if you intend to use a remote URL);
specifically change the calls that include --video_url and --use_audio_in_video
so they provide --video_path "${VIDEO_PATH}" (and ensure the environment
variable VIDEO_PATH is set) for the HF, Megatron, and exported-HF blocks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f53bab57-c997-4f8f-9914-c5462c6239ab

📥 Commits

Reviewing files that changed from the base of the PR and between 4b91a77 and aba5e9c.

📒 Files selected for processing (4)
  • examples/models/vlm/qwen25_omni/README.md
  • examples/models/vlm/qwen25_omni/conversion.sh
  • examples/models/vlm/qwen25_omni/inference.sh
  • src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/thinker_model.py

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

/ok to test a08769a

@yaoyu-33 yaoyu-33 added docs-only With great power comes great responsibility. labels Mar 24, 2026
@yaoyu-33 yaoyu-33 merged commit 8d2b0ea into main Mar 24, 2026
36 of 39 checks passed
@yaoyu-33 yaoyu-33 deleted the yuya/qwen25-omni-examples-fix branch March 24, 2026 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-only With great power comes great responsibility.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants