[model] Refactor Qwen3-VL and Ministral3 fine-tuning scripts by kamran-nvidia · Pull Request #2735 · NVIDIA-NeMo/Megatron-Bridge

kamran-nvidia · 2026-03-10T22:29:34Z

What does this PR do ?

[model] Refactor Qwen3-VL and Ministral3 fine-tuning scripts and recipes

Changelog

Summary

Refactor Qwen3-VL recipes: Remove dataset_type parameter from all SFT/PEFT config functions, replacing the inline Energon branching with a dedicated qwen3_vl_8b_peft_energon_config recipe. This simplifies each config function and makes dataset selection explicit via recipe choice rather than a runtime argument.
Remove --dataset_type CLI argument from run_recipe.py and all recipe loading functions, reducing interface complexity.
Rename Ministral3 scripts (peft.sh → peft_unpacked.sh, sft.sh → sft_unpacked.sh) and rename sft_energon.sh → peft_energon.sh for Qwen3-VL to better reflect their actual usage.
Fix bugs: Correct train.eval_iters → validation.eval_iters parameter reference in sft_unpacked.sh.
Update parallelism configs: Adjust TP/PP/EP settings in peft_unpacked.sh and sft_unpacked.sh for more practical finetuning configurations.
Add W&B report links for expected training dynamics in Qwen3-VL and Ministral3 READMEs.

Test plan

Verify Qwen3-VL 8B SFT with sft_unpacked.sh runs correctly with updated parallelism (TP=4,1 and TP=2,1)
Verify Qwen3-VL 8B PEFT with peft_unpacked.sh runs with updated parallelism configs
Verify peft_energon.sh works with the new qwen3_vl_8b_peft_energon_config recipe
Confirm run_recipe.py no longer accepts --dataset_type and existing scripts still work
Verify W&B report links in READMEs are accessible

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Added Energon dataset support for Qwen3-VL PEFT training workflows.
Documentation
- Updated training guides with new "Expected Training Dynamics" sections including performance benchmarks via WandB reports.
Chores
- Adjusted training hyperparameters (iterations, batch sizes, learning rates) across VLM example scripts for optimized performance.
- Updated model configurations to use Instruct model variants.
- Simplified training script parameter handling by removing dataset type configuration option.

…e-tuning - Updated README.md to reference new unpacked scripts for SFT and PEFT. - Introduced new `sft_unpacked.sh` and `peft_unpacked.sh` scripts for Ministral3 with enhanced configurations. - Modified `peft.sh` and `sft.sh` scripts for Qwen3-VL to improve training parameters and evaluation intervals. - Added `peft_energon.sh` for Energon dataset fine-tuning with LoRA. - Enhanced `qwen3_vl.py` to support Energon dataset and updated dataset handling in SFT and PEFT configurations. - Adjusted parallelism configurations and training parameters across various scripts for consistency and performance. Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

copy-pr-bot · 2026-03-10T22:29:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

…_unpacked.sh and sft_unpacked.sh Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

… in peft_unpacked.sh Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

…n README files Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia · 2026-03-12T17:35:25Z

/ok to test 37d2baa

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia · 2026-03-12T18:06:14Z

/ok to test 4011f62

coderabbitai · 2026-03-12T18:15:10Z

📝 Walkthrough

Walkthrough

This PR updates VLM training configurations and framework across ministral3 and qwen3_vl models. Key changes include adjusting hyperparameters (training iterations, batch sizes, learning rates) in shell scripts, updating recipe references from finetune to SFT/PEFT configs, removing dataset_type support from the training framework, adding new Energon dataset support with corresponding configurations, and updating model references to Instruct variants.

Changes

Cohort / File(s)	Summary
Documentation Updates `examples/models/vlm/ministral3/README.md`, `examples/models/vlm/qwen3_vl/README.md`	Added "Expected Training Dynamics" subsections with WandB report links; updated script references from sft.sh/peft.sh to unpacked/energon variants.
Ministral3 Training Configuration `examples/models/vlm/ministral3/sft_unpacked.sh`, `examples/models/vlm/ministral3/peft_unpacked.sh`	Updated hyperparameters (TRAIN_ITERS: 50→100, GLOBAL_BATCH_SIZE: 32→16, EVAL_ITERS: 10→20, learning rates adjusted); changed recipe names to sft_config/peft_config; moved eval_iters from train to validation section; added dataset.pack_sequences_in_batch=False.
Qwen3-VL Training Configuration `examples/models/vlm/qwen3_vl/peft.sh`, `examples/models/vlm/qwen3_vl/peft_unpacked.sh`, `examples/models/vlm/qwen3_vl/peft_energon.sh`, `examples/models/vlm/qwen3_vl/sft.sh`, `examples/models/vlm/qwen3_vl/sft_unpacked.sh`	Standardized hyperparameters across configs (TRAIN_ITERS: 50→100, batch size reductions, EVAL_ITERS: 10→20, added EVAL_INTERVAL: 20); updated recipe references to sft_config/peft_config; migrated evaluation parameters to validation section; updated parallelism configurations; removed MoE test blocks from sft scripts; added energon-specific recipe variant.
Training Framework `scripts/training/run_recipe.py`	Removed dataset_type parameter from CLI arguments, load_recipe signature, and all dynamic kwargs construction; simplified configuration passing by eliminating conditional dataset_type handling.
VLM Recipe Configuration `src/megatron/bridge/recipes/qwen_vl/__init__.py`, `src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py`	Added new qwen3_vl_8b_peft_energon_config function with Energon dataset support via _make_energon_dataset helper; updated HF model references from standard to Instruct variants (8B, 30B-A3B, 235B-A22B); added alltoall token dispatcher for MoE configs; exposed new Energon config in public API.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Add refactored SFT, PEFT recipes for VLMs #2614: Overlapping modifications to VLM recipe modules (qwen3_vl and ministral3) with changes to SFT/PEFT configuration functions and training setups.

Suggested reviewers

yaoyu-33

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the primary change: refactoring fine-tuning scripts for Qwen3-VL and Ministral3 models, which aligns with the major modifications across multiple script files and recipe configurations.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR includes W&B report links in modified READMEs documenting loss curves and grad norms for hyperparameter and parallelism changes, with PR objectives confirming completion of testing for major modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch kamran/variuos_fixes

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.

OpenGrep is compatible with Semgrep configurations. Add an opengrep.yml or semgrep.yml configuration file to your project to enable OpenGrep analysis.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/models/vlm/ministral3/README.md`:
- Around line 127-128: The README's "Expected Training Dynamics" section now
links an existing Weights & Biases report but older sentences still say "W&B
report coming soon" — remove or replace those stale lines (the instances
referencing "W&B report coming soon" near the W&B link and earlier in the file)
so the document consistently reflects that the W&B report exists; update any
surrounding text to reference the provided link (the W&B report URL and the
"Expected Training Dynamics" section) instead of the placeholder.
- Around line 104-110: Update the "Available recipes" section in
examples/models/vlm/ministral3/README.md to match the actual exports in
src/megatron/bridge/recipes/ministral3/__init__.py by replacing any occurrences
of ministral3_*_finetune_config with ministral3_*_sft_config and
ministral3_*_peft_config where appropriate, and ensure the README recipe
examples and script references (sft_unpacked.sh, peft_unpacked.sh) consistently
use the *_sft_config and *_peft_config names so users can load the recipes as
exported.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: de9f982b-b22c-4869-8ffc-e57c667e1a18

📥 Commits

Reviewing files that changed from the base of the PR and between 52eaf8c and 4011f62.

📒 Files selected for processing (12)

examples/models/vlm/ministral3/README.md
examples/models/vlm/ministral3/peft_unpacked.sh
examples/models/vlm/ministral3/sft_unpacked.sh
examples/models/vlm/qwen3_vl/README.md
examples/models/vlm/qwen3_vl/peft.sh
examples/models/vlm/qwen3_vl/peft_energon.sh
examples/models/vlm/qwen3_vl/peft_unpacked.sh
examples/models/vlm/qwen3_vl/sft.sh
examples/models/vlm/qwen3_vl/sft_unpacked.sh
scripts/training/run_recipe.py
src/megatron/bridge/recipes/qwen_vl/__init__.py
src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py

💤 Files with no reviewable changes (1)

scripts/training/run_recipe.py

examples/models/vlm/ministral3/README.md

examples/models/vlm/qwen3_vl/README.md

src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py

…n3 VL PEFT tests Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia · 2026-03-12T20:45:34Z

/ok to test b61857f

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia · 2026-03-12T20:50:09Z

/ok to test 2299ea6

cuichenx

LGTM

Merge branch 'main' into kamran/variuos_fixes

3fe84d5

yaoyu-33 added model-qwen area:model Model implementations and HF bridge logic and removed model-qwen labels Mar 11, 2026

yaoyu-33 reviewed Mar 11, 2026

View reviewed changes

src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py Outdated Show resolved Hide resolved

kamran-nvidia added 6 commits March 11, 2026 09:50

[model, recipe] Add Qwen3-VL 8B PEFT config with Energon dataset support

bd61712

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

[model] Update parallelism configurations for LoRA finetuning in peft…

a53e192

…_unpacked.sh and sft_unpacked.sh Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

[model] Remove redundant parallelism configuration for MoE finetuning…

0a8e330

… in peft_unpacked.sh Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

[model] Remove dataset_type argument from recipe loading functions

b7b51a1

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

Fix validation parameter reference in SFT script

ae822c8

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

[docs] Add expected training dynamics section with W&B report links i…

2a6a857

…n README files Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia changed the title ~~Refactor and enhance Qwen3-VL and Ministral3 scripts for improved fin…~~ [model] Refactor Qwen3-VL and Ministral3 fine-tuning scripts Mar 12, 2026

Merge branch 'main' into kamran/variuos_fixes

37d2baa

kamran-nvidia added 2 commits March 12, 2026 11:05

fix: Linting in qwen3_vl.py

2c1797c

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

Merge branch 'main' into kamran/variuos_fixes

4011f62

kamran-nvidia marked this pull request as ready for review March 12, 2026 18:06

kamran-nvidia requested a review from cuichenx March 12, 2026 18:06

copy-pr-bot bot temporarily deployed to test March 12, 2026 18:07 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 12, 2026 18:07 Inactive

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

examples/models/vlm/ministral3/README.md Show resolved Hide resolved

examples/models/vlm/ministral3/README.md Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to nemo-ci March 12, 2026 18:15 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 12, 2026 18:22 Inactive

cuichenx reviewed Mar 12, 2026

View reviewed changes

examples/models/vlm/ministral3/README.md Outdated Show resolved Hide resolved

examples/models/vlm/qwen3_vl/README.md Outdated Show resolved Hide resolved

src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py Show resolved Hide resolved

fix: Update README and add assertions in EnergonProvider; enhance Qwe…

b61857f

…n3 VL PEFT tests Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

fix: Linting

2299ea6

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 12, 2026 20:51 Inactive

cuichenx approved these changes Mar 12, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci March 12, 2026 21:02 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 12, 2026 21:10 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 12, 2026 21:18 Failure

copy-pr-bot bot temporarily deployed to nemo-ci March 12, 2026 21:18 Inactive

Conversation

kamran-nvidia commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Summary

Test plan

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 10, 2026

Uh oh!

Uh oh!

kamran-nvidia commented Mar 12, 2026

Uh oh!

kamran-nvidia commented Mar 12, 2026

Uh oh!

coderabbitai bot commented Mar 12, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kamran-nvidia commented Mar 12, 2026

Uh oh!

kamran-nvidia commented Mar 12, 2026

Uh oh!

cuichenx left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kamran-nvidia commented Mar 10, 2026 •

edited by coderabbitai bot

Loading