Skip to content

Add refactored SFT, PEFT recipes for VLMs#2614

Merged
athitten merged 11 commits intomainfrom
athitten/vlm_recipe_refactor
Mar 4, 2026
Merged

Add refactored SFT, PEFT recipes for VLMs#2614
athitten merged 11 commits intomainfrom
athitten/vlm_recipe_refactor

Conversation

@athitten
Copy link
Contributor

@athitten athitten commented Mar 2, 2026

What does this PR do ?

Adds the new parameterless API recipes for VLM models similar to #2067 and #2268. Replaces the existing finetuning config files with separate *_sft_config() and *_peft_config() for each variant.
Deletes the pretrain_config() for VLMs and only keeps the finetuning recipes.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • New Features

    • Introduced simplified configuration functions for Vision-Language Model fine-tuning and parameter-efficient training across Gemma3-VL, GLM-4.5V, Ministral3, Nemotron Nano V2, Qwen2.5-VL, and Qwen3-VL models.
  • Refactor

    • Streamlined configuration setup by replacing complex parameterized systems with explicit, parameterless configuration entry points for each model size and training approach.

athitten added 5 commits March 2, 2026 14:17
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@athitten athitten requested a review from yaoyu-33 March 2, 2026 22:26
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

📝 Walkthrough

Walkthrough

The PR refactors VLM finetuning recipes from parameterized configuration systems to a parameterless API with explicit factory functions. It adds shared VLM configuration helpers in common.py and updates multiple model recipes (Gemma3-VL, GLM-4.5V, Ministral3, Nemotron-VL, Qwen2.5-VL, Qwen3-VL) to use these helpers and provide discrete, per-model-size configuration entry points for SFT and PEFT training modes.

Changes

Cohort / File(s) Summary
Common VLM Configuration Helpers
src/megatron/bridge/recipes/common.py
Introduces two new VLM-specific base configuration functions: _sft_common_vlm() and _peft_common_vlm(), extending existing SFT/PEFT templates with VLM-specific defaults (seq_length, training iterations, batch sizes), HFDatasetConversationProvider, NullTokenizer configuration, and learning rates for adapter training.
Gemma3-VL Recipe Refactor
src/megatron/bridge/recipes/gemma3_vl/gemma3_vl.py
Replaces parameterized configuration builders with six discrete factory functions for 4B/12B/27B models covering both SFT and PEFT modes. Removes large TypedDicts and monolithic helpers in favor of explicit, per-model-size configuration entry points using shared VLM helpers.
GLM-4.5V Recipe Refactor
src/megatron/bridge/recipes/glm_vl/glm_45v.py
Refactors from parameter-heavy configuration to two parameterless functions (glm_45v_sft_config() and glm_45v_peft_config()), leveraging common VLM scaffolding with AutoBridge-based model loading and MOE/VLM-specific settings.
Ministral3 Recipe Refactor
src/megatron/bridge/recipes/ministral3/ministral3.py
Replaces flexible finetune interface with six explicit config factory functions for 3B/8B/14B models in SFT and PEFT modes, removing legacy TypedDict-based option handling and using common VLM helpers for base configurations.
Nemotron-VL Recipe Refactor
src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py
Replaces monolithic pretrain configuration with two explicit entry points: nemotron_nano_v2_vl_12b_sft_config() and nemotron_nano_v2_vl_12b_peft_config(), consolidating SFT/PEFT training flows using common VLM helpers.
Qwen2.5-VL Recipe Refactor
src/megatron/bridge/recipes/qwen_vl/qwen25_vl.py
Introduces eight parameterless factory functions for 3B/7B/32B/72B models in SFT and PEFT modes, eliminating dynamic kwargs-based configuration paths and consolidating model initialization through static, self-contained config builders.
Qwen3-VL Recipe Refactor
src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py
Replaces parameterized common configuration scaffolding with six discrete factory functions for 8B/30B-A3B/235B-A22B models in SFT and PEFT modes, using AutoBridge model loading and common VLM helpers for configuration assembly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

Run CICD

Suggested reviewers

  • cuichenx
  • ko3n1g
  • yaoyu-33
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains major refactoring across 7 VLM recipe files (~3,700+ net lines changed) with identified bugs in learning rate configs (min_lr > max_lr) but no test results or convergence validation documented. Provide documented test results validating training convergence and performance; fix identified learning rate bugs; include benchmark results demonstrating no regression.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the main change: introducing refactored SFT and PEFT recipes for vision-language models. It accurately reflects the primary objective shown in the PR objectives and file-level summaries.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch athitten/vlm_recipe_refactor

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py (1)

224-232: Consider using a higher learning rate for PEFT.

Both SFT and PEFT configs use the same max_lr=2e-5. Typically, PEFT training benefits from a higher learning rate (e.g., 1e-4) since only adapter weights are being trained. The base _peft_common_vlm() uses max_lr=1e-4 for this reason.

Is this intentional for Nemotron, or should the PEFT config use a higher LR?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py` around lines
224 - 232, The PEFT optimizer block currently sets max_lr=2e-5 which matches SFT
but differs from the expected PEFT LR used in _peft_common_vlm(); update the
call to distributed_fused_adam_with_cosine_annealing in nemotron_nano_v2_vl.py
(the block that assigns cfg.optimizer and cfg.scheduler) to use a higher max_lr
(e.g., 1e-4) consistent with PEFT, or fetch the value used by _peft_common_vlm()
and apply it here so PEFT trains adapters with the intended learning rate.
src/megatron/bridge/recipes/common.py (1)

444-546: LGTM - consistent PEFT VLM base builder.

The _peft_common_vlm() function mirrors the structure of _sft_common_vlm() with appropriate PEFT-specific defaults (higher LR for adapter training). Keeping the LoRA config from _peft_common() is a good choice for consistency.

Consider extracting the shared VLM configuration overrides (output directories, seq_length, training/validation config, DDP, dataset, tokenizer, logger, checkpoint, RNG) into a helper function to reduce duplication between _sft_common_vlm() and _peft_common_vlm(). However, this is a minor nit since the current explicit approach improves readability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/common.py` around lines 444 - 546, The code
duplicates VLM-specific overrides between _peft_common_vlm() and
_sft_common_vlm(); extract the shared logic into a helper (e.g.,
_apply_common_vlm_overrides(cfg, base_output_dir=None)) and call it from both
builders. Move the output directory setup, seq_length, train/validation
settings, DDP config, dataset initialization (hf_processor_path left as None),
NullTokenizer config, logger, checkpoint settings, and RNG seed into the helper,
while keeping PEFT-specific changes (optimizer/scheduler, LoRA) inside
_peft_common_vlm() and SFT-specific changes inside _sft_common_vlm(); update
both functions to start from their base config, call the helper, then apply
their remaining overrides.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py`:
- Around line 144-155: The docstring for nemotron_nano_v2_vl_12b_peft_config
incorrectly states "LR=5e-5 (PEFT)"; update that docstring to match the actual
optimizer setting used in the function (max_lr=2e-5) so the documentation
reflects the implemented learning rate.

In `@src/megatron/bridge/recipes/qwen_vl/qwen25_vl.py`:
- Around line 92-98: The cosine annealing call
distributed_fused_adam_with_cosine_annealing currently sets max_lr=5e-6 and
min_lr=3e-5 (inverted); update each SFT config that calls
distributed_fused_adam_with_cosine_annealing (the four Qwen2.5-VL SFT functions
for 3B, 7B, 32B, 72B) so min_lr = 0.1 * max_lr (i.e., change min_lr to 5e-7 when
max_lr=5e-6) to ensure the LR decays rather than increases.

In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py`:
- Around line 387-393: The learning-rate bounds are reversed in the SFT
optimizer config: distributed_fused_adam_with_cosine_annealing is called with
max_lr=5e-6 and min_lr=3e-5, which will increase LR over training; swap or
correct these values so max_lr > min_lr (e.g., set max_lr to 3e-5 and min_lr to
5e-6 or use the intended values consistent with other configs), updating the
call site of distributed_fused_adam_with_cosine_annealing in qwen3_vl.py to use
the correct max_lr and min_lr.

---

Nitpick comments:
In `@src/megatron/bridge/recipes/common.py`:
- Around line 444-546: The code duplicates VLM-specific overrides between
_peft_common_vlm() and _sft_common_vlm(); extract the shared logic into a helper
(e.g., _apply_common_vlm_overrides(cfg, base_output_dir=None)) and call it from
both builders. Move the output directory setup, seq_length, train/validation
settings, DDP config, dataset initialization (hf_processor_path left as None),
NullTokenizer config, logger, checkpoint settings, and RNG seed into the helper,
while keeping PEFT-specific changes (optimizer/scheduler, LoRA) inside
_peft_common_vlm() and SFT-specific changes inside _sft_common_vlm(); update
both functions to start from their base config, call the helper, then apply
their remaining overrides.

In `@src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py`:
- Around line 224-232: The PEFT optimizer block currently sets max_lr=2e-5 which
matches SFT but differs from the expected PEFT LR used in _peft_common_vlm();
update the call to distributed_fused_adam_with_cosine_annealing in
nemotron_nano_v2_vl.py (the block that assigns cfg.optimizer and cfg.scheduler)
to use a higher max_lr (e.g., 1e-4) consistent with PEFT, or fetch the value
used by _peft_common_vlm() and apply it here so PEFT trains adapters with the
intended learning rate.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0034dda and 399a7cd.

📒 Files selected for processing (7)
  • src/megatron/bridge/recipes/common.py
  • src/megatron/bridge/recipes/gemma3_vl/gemma3_vl.py
  • src/megatron/bridge/recipes/glm_vl/glm_45v.py
  • src/megatron/bridge/recipes/ministral3/ministral3.py
  • src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py
  • src/megatron/bridge/recipes/qwen_vl/qwen25_vl.py
  • src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py

Comment on lines +144 to +155
def nemotron_nano_v2_vl_12b_peft_config(peft_scheme: str | PEFT = "lora") -> ConfigContainer:
"""Return a PEFT config for Nemotron Nano V2 VL 12B.

Default configuration: 1 node, 8 GPUs
- TP=2, PP=1
- LR=5e-5 (PEFT)
- Sequence length: 4096

Args:
peft_scheme: PEFT scheme - "lora", "dora", or a custom PEFT instance.
Note: Default uses VLMLoRA targeting all model components.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring LR value doesn't match implementation.

The docstring states "LR=5e-5 (PEFT)" but the actual optimizer configuration at line 228 uses max_lr=2e-5. Please update the docstring to reflect the actual learning rate.

📝 Suggested docstring fix
 def nemotron_nano_v2_vl_12b_peft_config(peft_scheme: str | PEFT = "lora") -> ConfigContainer:
     """Return a PEFT config for Nemotron Nano V2 VL 12B.
 
     Default configuration: 1 node, 8 GPUs
     - TP=2, PP=1
-    - LR=5e-5 (PEFT)
+    - LR=2e-5 (PEFT)
     - Sequence length: 4096
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py` around lines
144 - 155, The docstring for nemotron_nano_v2_vl_12b_peft_config incorrectly
states "LR=5e-5 (PEFT)"; update that docstring to match the actual optimizer
setting used in the function (max_lr=2e-5) so the documentation reflects the
implemented learning rate.

Comment on lines +92 to +98
# Optimizer - lower LR for full SFT
opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing(
lr_warmup_iters=500,
lr_decay_iters=300000,
max_lr=5e-6,
min_lr=3e-5,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, read the qwen25_vl.py file to confirm the values
cat -n src/megatron/bridge/recipes/qwen_vl/qwen25_vl.py | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4299


🏁 Script executed:

# Check all SFT config sections for learning rate values
rg "max_lr|min_lr" src/megatron/bridge/recipes/qwen_vl/qwen25_vl.py -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1179


🏁 Script executed:

# Search for other recipe files using cosine_annealing to find patterns
find src/megatron/bridge/recipes -name "*.py" -type f | xargs grep -l "cosine_annealing" | head -5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 302


🏁 Script executed:

# Check gemma3_vl.py for SFT learning rate pattern
rg "max_lr|min_lr" src/megatron/bridge/recipes/gemma3_vl/gemma3_vl.py -B 2 -A 2 | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 907


🏁 Script executed:

# Check the cosine_annealing implementation to understand behavior
cat -n src/megatron/bridge/recipes/utils/optimizer_utils.py | grep -A 30 "def distributed_fused_adam_with_cosine_annealing"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3182


🏁 Script executed:

# Check other recipes for pattern consistency
rg "max_lr.*5e-6" src/megatron/bridge/recipes -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 8195


Bug confirmed: min_lr (3e-5) is greater than max_lr (5e-6) in all four Qwen2.5-VL SFT configurations.

In cosine annealing, the learning rate should decay from max_lr to min_lr. With min_lr > max_lr, the schedule inverts and the learning rate will increase rather than decrease during training. Reference recipes like GLM-VL and Gemma3-VL consistently use min_lr = 0.1 * max_lr. For SFT with max_lr=5e-6, the correct min_lr should be 5e-7.

This issue affects the 3B, 7B, 32B, and 72B SFT configurations (lines 92–98, 200–206, 308–314, and 416–422).

Proposed fix
     opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing(
         lr_warmup_iters=500,
         lr_decay_iters=300000,
         max_lr=5e-6,
-        min_lr=3e-5,
+        min_lr=5e-7,
     )

Apply to all four SFT config functions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen_vl/qwen25_vl.py` around lines 92 - 98, The
cosine annealing call distributed_fused_adam_with_cosine_annealing currently
sets max_lr=5e-6 and min_lr=3e-5 (inverted); update each SFT config that calls
distributed_fused_adam_with_cosine_annealing (the four Qwen2.5-VL SFT functions
for 3B, 7B, 32B, 72B) so min_lr = 0.1 * max_lr (i.e., change min_lr to 5e-7 when
max_lr=5e-6) to ensure the LR decays rather than increases.

Comment on lines +387 to +393
# Optimizer - lower LR for full SFT
opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing(
lr_warmup_iters=500,
lr_decay_iters=300000,
max_lr=5e-6,
min_lr=3e-5,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: min_lr is greater than max_lr in 235B-A22B SFT config.

Same issue as Qwen2.5-VL: the schedule has max_lr=5e-6 but min_lr=3e-5. This will cause the LR to increase over training instead of decreasing.

Note: The 8B and 30B-A3B SFT configs are correct (max_lr=5e-5, min_lr=5e-6).

🐛 Proposed fix
     opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing(
         lr_warmup_iters=500,
         lr_decay_iters=300000,
         max_lr=5e-6,
-        min_lr=3e-5,
+        min_lr=5e-7,  # 10% of max_lr
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py` around lines 387 - 393, The
learning-rate bounds are reversed in the SFT optimizer config:
distributed_fused_adam_with_cosine_annealing is called with max_lr=5e-6 and
min_lr=3e-5, which will increase LR over training; swap or correct these values
so max_lr > min_lr (e.g., set max_lr to 3e-5 and min_lr to 5e-6 or use the
intended values consistent with other configs), updating the call site of
distributed_fused_adam_with_cosine_annealing in qwen3_vl.py to use the correct
max_lr and min_lr.

lr_warmup_iters=500,
lr_decay_iters=300000,
max_lr=5e-6,
min_lr=3e-5,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using train_iters, gbs, mbs, eval_iters, max_lr, min_lr, lr_warmup_iters from the existing config since the example does not contain config for 235b.

# =============================================================================
# Qwen3-VL 8B SFT Configuration
# =============================================================================
def qwen3_vl_8b_sft_config() -> ConfigContainer:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using all values from the existing config since examples here uses the config values from recipes

@athitten
Copy link
Contributor Author

athitten commented Mar 3, 2026

/ok to test 399a7cd

Signed-off-by: Abhishree <abhishreetm@gmail.com>
@athitten
Copy link
Contributor Author

athitten commented Mar 3, 2026

/ok to test 6b110b5

Signed-off-by: Abhishree <abhishreetm@gmail.com>
@athitten
Copy link
Contributor Author

athitten commented Mar 3, 2026

/ok to test 028a0e8

Signed-off-by: Abhishree <abhishreetm@gmail.com>
@athitten
Copy link
Contributor Author

athitten commented Mar 4, 2026

/ok to test 190479e

yaoyu-33
yaoyu-33 previously approved these changes Mar 4, 2026
@athitten
Copy link
Contributor Author

athitten commented Mar 4, 2026

/ok to test a32b7f4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants