Add refactored SFT, PEFT recipes for VLMs#2614
Conversation
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
📝 WalkthroughWalkthroughThe PR refactors VLM finetuning recipes from parameterized configuration systems to a parameterless API with explicit factory functions. It adds shared VLM configuration helpers in Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py (1)
224-232: Consider using a higher learning rate for PEFT.Both SFT and PEFT configs use the same
max_lr=2e-5. Typically, PEFT training benefits from a higher learning rate (e.g., 1e-4) since only adapter weights are being trained. The base_peft_common_vlm()usesmax_lr=1e-4for this reason.Is this intentional for Nemotron, or should the PEFT config use a higher LR?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py` around lines 224 - 232, The PEFT optimizer block currently sets max_lr=2e-5 which matches SFT but differs from the expected PEFT LR used in _peft_common_vlm(); update the call to distributed_fused_adam_with_cosine_annealing in nemotron_nano_v2_vl.py (the block that assigns cfg.optimizer and cfg.scheduler) to use a higher max_lr (e.g., 1e-4) consistent with PEFT, or fetch the value used by _peft_common_vlm() and apply it here so PEFT trains adapters with the intended learning rate.src/megatron/bridge/recipes/common.py (1)
444-546: LGTM - consistent PEFT VLM base builder.The
_peft_common_vlm()function mirrors the structure of_sft_common_vlm()with appropriate PEFT-specific defaults (higher LR for adapter training). Keeping the LoRA config from_peft_common()is a good choice for consistency.Consider extracting the shared VLM configuration overrides (output directories, seq_length, training/validation config, DDP, dataset, tokenizer, logger, checkpoint, RNG) into a helper function to reduce duplication between
_sft_common_vlm()and_peft_common_vlm(). However, this is a minor nit since the current explicit approach improves readability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/common.py` around lines 444 - 546, The code duplicates VLM-specific overrides between _peft_common_vlm() and _sft_common_vlm(); extract the shared logic into a helper (e.g., _apply_common_vlm_overrides(cfg, base_output_dir=None)) and call it from both builders. Move the output directory setup, seq_length, train/validation settings, DDP config, dataset initialization (hf_processor_path left as None), NullTokenizer config, logger, checkpoint settings, and RNG seed into the helper, while keeping PEFT-specific changes (optimizer/scheduler, LoRA) inside _peft_common_vlm() and SFT-specific changes inside _sft_common_vlm(); update both functions to start from their base config, call the helper, then apply their remaining overrides.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py`:
- Around line 144-155: The docstring for nemotron_nano_v2_vl_12b_peft_config
incorrectly states "LR=5e-5 (PEFT)"; update that docstring to match the actual
optimizer setting used in the function (max_lr=2e-5) so the documentation
reflects the implemented learning rate.
In `@src/megatron/bridge/recipes/qwen_vl/qwen25_vl.py`:
- Around line 92-98: The cosine annealing call
distributed_fused_adam_with_cosine_annealing currently sets max_lr=5e-6 and
min_lr=3e-5 (inverted); update each SFT config that calls
distributed_fused_adam_with_cosine_annealing (the four Qwen2.5-VL SFT functions
for 3B, 7B, 32B, 72B) so min_lr = 0.1 * max_lr (i.e., change min_lr to 5e-7 when
max_lr=5e-6) to ensure the LR decays rather than increases.
In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py`:
- Around line 387-393: The learning-rate bounds are reversed in the SFT
optimizer config: distributed_fused_adam_with_cosine_annealing is called with
max_lr=5e-6 and min_lr=3e-5, which will increase LR over training; swap or
correct these values so max_lr > min_lr (e.g., set max_lr to 3e-5 and min_lr to
5e-6 or use the intended values consistent with other configs), updating the
call site of distributed_fused_adam_with_cosine_annealing in qwen3_vl.py to use
the correct max_lr and min_lr.
---
Nitpick comments:
In `@src/megatron/bridge/recipes/common.py`:
- Around line 444-546: The code duplicates VLM-specific overrides between
_peft_common_vlm() and _sft_common_vlm(); extract the shared logic into a helper
(e.g., _apply_common_vlm_overrides(cfg, base_output_dir=None)) and call it from
both builders. Move the output directory setup, seq_length, train/validation
settings, DDP config, dataset initialization (hf_processor_path left as None),
NullTokenizer config, logger, checkpoint settings, and RNG seed into the helper,
while keeping PEFT-specific changes (optimizer/scheduler, LoRA) inside
_peft_common_vlm() and SFT-specific changes inside _sft_common_vlm(); update
both functions to start from their base config, call the helper, then apply
their remaining overrides.
In `@src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py`:
- Around line 224-232: The PEFT optimizer block currently sets max_lr=2e-5 which
matches SFT but differs from the expected PEFT LR used in _peft_common_vlm();
update the call to distributed_fused_adam_with_cosine_annealing in
nemotron_nano_v2_vl.py (the block that assigns cfg.optimizer and cfg.scheduler)
to use a higher max_lr (e.g., 1e-4) consistent with PEFT, or fetch the value
used by _peft_common_vlm() and apply it here so PEFT trains adapters with the
intended learning rate.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
src/megatron/bridge/recipes/common.pysrc/megatron/bridge/recipes/gemma3_vl/gemma3_vl.pysrc/megatron/bridge/recipes/glm_vl/glm_45v.pysrc/megatron/bridge/recipes/ministral3/ministral3.pysrc/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.pysrc/megatron/bridge/recipes/qwen_vl/qwen25_vl.pysrc/megatron/bridge/recipes/qwen_vl/qwen3_vl.py
| def nemotron_nano_v2_vl_12b_peft_config(peft_scheme: str | PEFT = "lora") -> ConfigContainer: | ||
| """Return a PEFT config for Nemotron Nano V2 VL 12B. | ||
|
|
||
| Default configuration: 1 node, 8 GPUs | ||
| - TP=2, PP=1 | ||
| - LR=5e-5 (PEFT) | ||
| - Sequence length: 4096 | ||
|
|
||
| Args: | ||
| peft_scheme: PEFT scheme - "lora", "dora", or a custom PEFT instance. | ||
| Note: Default uses VLMLoRA targeting all model components. | ||
| """ |
There was a problem hiding this comment.
Docstring LR value doesn't match implementation.
The docstring states "LR=5e-5 (PEFT)" but the actual optimizer configuration at line 228 uses max_lr=2e-5. Please update the docstring to reflect the actual learning rate.
📝 Suggested docstring fix
def nemotron_nano_v2_vl_12b_peft_config(peft_scheme: str | PEFT = "lora") -> ConfigContainer:
"""Return a PEFT config for Nemotron Nano V2 VL 12B.
Default configuration: 1 node, 8 GPUs
- TP=2, PP=1
- - LR=5e-5 (PEFT)
+ - LR=2e-5 (PEFT)
- Sequence length: 4096🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/recipes/nemotron_vl/nemotron_nano_v2_vl.py` around lines
144 - 155, The docstring for nemotron_nano_v2_vl_12b_peft_config incorrectly
states "LR=5e-5 (PEFT)"; update that docstring to match the actual optimizer
setting used in the function (max_lr=2e-5) so the documentation reflects the
implemented learning rate.
| # Optimizer - lower LR for full SFT | ||
| opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing( | ||
| lr_warmup_iters=500, | ||
| lr_decay_iters=300000, | ||
| max_lr=5e-6, | ||
| min_lr=3e-5, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, read the qwen25_vl.py file to confirm the values
cat -n src/megatron/bridge/recipes/qwen_vl/qwen25_vl.py | head -100Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4299
🏁 Script executed:
# Check all SFT config sections for learning rate values
rg "max_lr|min_lr" src/megatron/bridge/recipes/qwen_vl/qwen25_vl.py -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1179
🏁 Script executed:
# Search for other recipe files using cosine_annealing to find patterns
find src/megatron/bridge/recipes -name "*.py" -type f | xargs grep -l "cosine_annealing" | head -5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 302
🏁 Script executed:
# Check gemma3_vl.py for SFT learning rate pattern
rg "max_lr|min_lr" src/megatron/bridge/recipes/gemma3_vl/gemma3_vl.py -B 2 -A 2 | head -50Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 907
🏁 Script executed:
# Check the cosine_annealing implementation to understand behavior
cat -n src/megatron/bridge/recipes/utils/optimizer_utils.py | grep -A 30 "def distributed_fused_adam_with_cosine_annealing"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3182
🏁 Script executed:
# Check other recipes for pattern consistency
rg "max_lr.*5e-6" src/megatron/bridge/recipes -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 8195
Bug confirmed: min_lr (3e-5) is greater than max_lr (5e-6) in all four Qwen2.5-VL SFT configurations.
In cosine annealing, the learning rate should decay from max_lr to min_lr. With min_lr > max_lr, the schedule inverts and the learning rate will increase rather than decrease during training. Reference recipes like GLM-VL and Gemma3-VL consistently use min_lr = 0.1 * max_lr. For SFT with max_lr=5e-6, the correct min_lr should be 5e-7.
This issue affects the 3B, 7B, 32B, and 72B SFT configurations (lines 92–98, 200–206, 308–314, and 416–422).
Proposed fix
opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing(
lr_warmup_iters=500,
lr_decay_iters=300000,
max_lr=5e-6,
- min_lr=3e-5,
+ min_lr=5e-7,
)Apply to all four SFT config functions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/recipes/qwen_vl/qwen25_vl.py` around lines 92 - 98, The
cosine annealing call distributed_fused_adam_with_cosine_annealing currently
sets max_lr=5e-6 and min_lr=3e-5 (inverted); update each SFT config that calls
distributed_fused_adam_with_cosine_annealing (the four Qwen2.5-VL SFT functions
for 3B, 7B, 32B, 72B) so min_lr = 0.1 * max_lr (i.e., change min_lr to 5e-7 when
max_lr=5e-6) to ensure the LR decays rather than increases.
| # Optimizer - lower LR for full SFT | ||
| opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing( | ||
| lr_warmup_iters=500, | ||
| lr_decay_iters=300000, | ||
| max_lr=5e-6, | ||
| min_lr=3e-5, | ||
| ) |
There was a problem hiding this comment.
Bug: min_lr is greater than max_lr in 235B-A22B SFT config.
Same issue as Qwen2.5-VL: the schedule has max_lr=5e-6 but min_lr=3e-5. This will cause the LR to increase over training instead of decreasing.
Note: The 8B and 30B-A3B SFT configs are correct (max_lr=5e-5, min_lr=5e-6).
🐛 Proposed fix
opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing(
lr_warmup_iters=500,
lr_decay_iters=300000,
max_lr=5e-6,
- min_lr=3e-5,
+ min_lr=5e-7, # 10% of max_lr
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py` around lines 387 - 393, The
learning-rate bounds are reversed in the SFT optimizer config:
distributed_fused_adam_with_cosine_annealing is called with max_lr=5e-6 and
min_lr=3e-5, which will increase LR over training; swap or correct these values
so max_lr > min_lr (e.g., set max_lr to 3e-5 and min_lr to 5e-6 or use the
intended values consistent with other configs), updating the call site of
distributed_fused_adam_with_cosine_annealing in qwen3_vl.py to use the correct
max_lr and min_lr.
| lr_warmup_iters=500, | ||
| lr_decay_iters=300000, | ||
| max_lr=5e-6, | ||
| min_lr=3e-5, |
There was a problem hiding this comment.
Using train_iters, gbs, mbs, eval_iters, max_lr, min_lr, lr_warmup_iters from the existing config since the example does not contain config for 235b.
| # ============================================================================= | ||
| # Qwen3-VL 8B SFT Configuration | ||
| # ============================================================================= | ||
| def qwen3_vl_8b_sft_config() -> ConfigContainer: |
There was a problem hiding this comment.
Using all values from the existing config since examples here uses the config values from recipes
|
/ok to test 399a7cd |
Signed-off-by: Abhishree <abhishreetm@gmail.com>
|
/ok to test 6b110b5 |
Signed-off-by: Abhishree <abhishreetm@gmail.com>
|
/ok to test 028a0e8 |
Signed-off-by: Abhishree <abhishreetm@gmail.com>
|
/ok to test 190479e |
Signed-off-by: Abhishree <abhishreetm@gmail.com>
|
/ok to test a32b7f4 |
What does this PR do ?
Adds the new parameterless API recipes for VLM models similar to #2067 and #2268. Replaces the existing finetuning config files with separate *_sft_config() and *_peft_config() for each variant.
Deletes the pretrain_config() for VLMs and only keeps the finetuning recipes.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
New Features
Refactor