[examples][model] fix: add Qwen2.5-Omni examples and fix vision encoder inference by yaoyu-33 · Pull Request #2965 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-03-23T22:47:33Z

Summary

Follow-up to #2634 (Qwen2.5-Omni model support).

Add conversion and inference example scripts + README for Qwen/Qwen2.5-Omni-7B (examples/models/vlm/qwen25_omni/)
Fix inference crash in thinker_model.py: Qwen2_5OmniVisionEncoder.forward() returns BaseModelOutputWithPooling — extract .pooler_output (merger-projected features) before injecting into combined embeddings
Document ffmpeg setup (imageio-ffmpeg) and the --video_path requirement for --use_audio_in_video

Test plan

Conversion HF → Megatron: EXIT=0 on cluster
Inference with HF checkpoint (video-only): coherent output
Inference with imported Megatron checkpoint (video-only): matches HF output
Inference with local video + --use_audio_in_video: coherent audio-grounded output

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Added checkpoint conversion tools supporting Hugging Face and Megatron format interchange with multi-GPU validation
- Added distributed inference script for Qwen2.5-Omni model with audio-video support
Documentation
- Added comprehensive setup and usage documentation for Qwen2.5-Omni, including prerequisites and configuration guidance
Improvements
- Enhanced vision/audio embedding extraction in the Qwen2.5-Omni model

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

…i thinker model Qwen2_5OmniVisionEncoder.forward() returns BaseModelOutputWithPooling, not a plain tensor. Extract .pooler_output (the merger-projected features) before assigning to combined_embeddings to fix inference crash: TypeError: can't assign a BaseModelOutputWithPooling to a BFloat16Tensor Also update README to clarify qwen-omni-utils install command and note that --use_audio_in_video requires ffmpeg on the system. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

…sage notes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot · 2026-03-23T22:47:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-03-23T22:47:38Z

/ok to test aba5e9c

coderabbitai · 2026-03-23T22:53:56Z

📝 Walkthrough

Walkthrough

The PR adds documentation and tooling for the Qwen2.5-Omni vision-language model example, including a README with setup instructions, checkpoint conversion workflow orchestration, multi-GPU inference scripts, and fixes the thinker model to extract pooled visual embeddings instead of raw encoder outputs.

Changes

Cohort / File(s)	Summary
Documentation `examples/models/vlm/qwen25_omni/README.md`	Introduces Qwen2.5-Omni example directory with model metadata, prerequisites for audio/video support, workspace configuration, and documentation of checkpoint conversion (HF→Megatron import/export) and inference commands.
Checkpoint Conversion Tooling `examples/models/vlm/qwen25_omni/conversion.sh`	Bash script orchestrating three-stage checkpoint conversion workflow: HF to Megatron import, Megatron to HF export, and multi-GPU round-trip validation using tensor/pipe parallel configuration.
Inference Orchestration `examples/models/vlm/qwen25_omni/inference.sh`	Bash script executing distributed multi-GPU inference across three checkpoint sources: native HF, imported Megatron, and exported HF checkpoints, with audio-video feature extraction and token generation.
Model Implementation `src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/thinker_model.py`	Updated vision/audio embedding extraction to use `pooler_output` from visual encoder instead of raw encoder outputs for downstream token replacement.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the two main changes: adding Qwen2.5-Omni examples and fixing vision encoder inference with pooler_output extraction.
Test Results For Major Changes	✅ Passed	PR describes comprehensive test results including conversion validation (EXIT=0), inference testing with HF and Megatron checkpoints, output consistency verification, and audio-visual feature testing.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch yuya/qwen25-omni-examples-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/models/vlm/qwen25_omni/inference.sh`:
- Around line 25-53: The script passes --video_url together with
--use_audio_in_video which violates the audio extraction flow; update the three
invocations that call examples/conversion/hf_to_megatron_generate_omni_lm.py to
use --video_path (local file) instead of --video_url when --use_audio_in_video
is present (or remove --use_audio_in_video if you intend to use a remote URL);
specifically change the calls that include --video_url and --use_audio_in_video
so they provide --video_path "${VIDEO_PATH}" (and ensure the environment
variable VIDEO_PATH is set) for the HF, Megatron, and exported-HF blocks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f53bab57-c997-4f8f-9914-c5462c6239ab

📥 Commits

Reviewing files that changed from the base of the PR and between 4b91a77 and aba5e9c.

📒 Files selected for processing (4)

examples/models/vlm/qwen25_omni/README.md
examples/models/vlm/qwen25_omni/conversion.sh
examples/models/vlm/qwen25_omni/inference.sh
src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/thinker_model.py

examples/models/vlm/qwen25_omni/inference.sh

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-03-23T23:24:03Z

/ok to test a08769a

yaoyu-33 added 3 commits March 23, 2026 15:47

[examples] docs: add qwen2.5-omni conversion and inference examples

4325675

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

[doc] fix: update Qwen2.5-Omni README with ffmpeg install and audio u…

aba5e9c

…sage notes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot bot temporarily deployed to test March 23, 2026 22:48 Inactive

coderabbitai bot reviewed Mar 23, 2026

View reviewed changes

examples/models/vlm/qwen25_omni/inference.sh Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to nemo-ci March 23, 2026 23:00 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 23, 2026 23:10 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 23, 2026 23:10 Error

[doc] fix: use --video_path in inference.sh for audio+video mode

a08769a

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot bot temporarily deployed to test March 23, 2026 23:25 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 23, 2026 23:29 Inactive

copy-pr-bot bot temporarily deployed to public March 23, 2026 23:30 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 23, 2026 23:37 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 23, 2026 23:43 Inactive

cuichenx approved these changes Mar 24, 2026

View reviewed changes

yaoyu-33 added docs-only With great power comes great responsibility. labels Mar 24, 2026

yaoyu-33 merged commit 8d2b0ea into main Mar 24, 2026
36 of 39 checks passed

yaoyu-33 deleted the yuya/qwen25-omni-examples-fix branch March 24, 2026 02:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[examples][model] fix: add Qwen2.5-Omni examples and fix vision encoder inference#2965

[examples][model] fix: add Qwen2.5-Omni examples and fix vision encoder inference#2965
yaoyu-33 merged 4 commits intomainfrom
yuya/qwen25-omni-examples-fix

yaoyu-33 commented Mar 23, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 23, 2026

Uh oh!

yaoyu-33 commented Mar 23, 2026

Uh oh!

coderabbitai bot commented Mar 23, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

yaoyu-33 commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaoyu-33 commented Mar 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Mar 23, 2026

Uh oh!

yaoyu-33 commented Mar 23, 2026

Uh oh!

coderabbitai bot commented Mar 23, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yaoyu-33 commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaoyu-33 commented Mar 23, 2026 •

edited by coderabbitai bot

Loading