fix: Fix PEFT recompute hook by yaoyu-33 · Pull Request #1762 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2025-12-17T22:47:30Z

Summary

add a reusable recompute helper (with attribution) to ensure checkpointed transformer blocks see grad-requiring inputs during adapter-only finetuning
invoke the helper when PEFT transforms are applied so adapter-only runs no longer drop gradients
cover the helper with a focused unit test exercising the grad-enable path and duplicate patch guard

copy-pr-bot · 2025-12-17T22:47:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

HollowMan6 · 2025-12-17T23:11:24Z

@yaoyu-33 Is it possible to apply the patching logic inside __call__(self, model: ModelType, training: bool = True) in https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/174298a35c87f2d0e14ef0d3cf91f1c030de70b6/src/megatron/bridge/peft/base.py directly? I just found that ~~the current patching place seems to only apply to SFT (~~ Verl doesn't call the logics in src/megatron/bridge/training/setup.py), that way this will benefit for all downstream projects. #1750 (comment)

Reference: https://github.com/volcengine/verl/blob/16a6c4791c5f829c4a0c207ee3a086e90f855157/verl/utils/megatron_utils.py#L218-L222

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2025-12-18T20:19:23Z

@HollowMan6 updated.

yaoyu-33 · 2025-12-18T20:19:48Z

/ok to test b90af9e

erictang000 · 2025-12-19T02:27:38Z

confirmed that this is working for me!

pink is with the fix (blue is without activation checkpointing, purple is without LoRA)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2025-12-19T18:24:22Z

/ok to test 8eb93ef

Enables LoRA training with the Megatron Backend. Currently waiting for NVIDIA-NeMo/Megatron-Bridge#1762 to be merged into main, so we can at least pin a commit rather than a branch for stability. - Adds [LoRA](https://docs.nvidia.com/nemo/megatron-bridge/0.2.0/apidocs/bridge/bridge.peft.lora.html) support via Megatron-Bridge - Adds custom checkpointing for LoRA model parameters (until LoRA checkpointing logic is upstreamed to Megatron-Bridge). - Weight syncing logic for Megatron + LoRA is handled by merging the LoRA parameters back into the base model before exporting to vLLM. This means that for megatron lora (for now), lora does not have to be configured for vLLM. ## Examples GSM8K for Qwen3-30B-MoE and Qwen3-0.6B converging: <img width="1087" height="808" alt="image" src="https://github.com/user-attachments/assets/95e03b75-4a8c-4734-8f55-2cf535b04876" /> - Qwen3-30B-A3B previously required 2 H100 nodes for full parameter fine tuning - we can increase batch size compared to previous runs with LoRA on just 1 H100 node! ### DAPO Qwen-4B With TIS - megatron dense backend can match/exceed FSDP backend perf. TIS is especially important for the current version of LoRA. Canonical Lora seems to be less good than "performant lora" - or maybe more sensitive to learning rate. <img width="1214" height="814" alt="image" src="https://github.com/user-attachments/assets/4c2d2b37-f835-4e53-ac54-7e54812b6006" /> Blockers/TODOs: - [x] ~~For Dense models, LoRA results in low grad norm/0 ppo_clip_ratio unless pp > 1. Something on megatron-core or megatron-bridge is broken for dense models.~~ Issue tracked on Megatron-Bridge (NVIDIA-NeMo/Megatron-Bridge#1750), awaiting PR NVIDIA-NeMo/Megatron-Bridge#1762 - [x] Test out MoE models ## Future Work - Once Megatron-Bridge support for exporting only lora parameters is done, we should support just syncing these to vLLM for lower communication cost - Add support for other LoRA variants from Megatron-Bridge (canonical lora, qlora, dora).

Fix adapter-only recompute hook

310926f

yaoyu-33 changed the title ~~Fix PEFT recompute hook~~ fix: Fix PEFT recompute hook Dec 17, 2025

lint

50a4c2a

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

This was referenced Dec 17, 2025

PEFT Training fails for dense models when PP = 1 when activation checkpointing is on #1750

Open

[megatron] fix: LoRA with activation checkpointing verl-project/verl#4566

Closed

erictang000 mentioned this pull request Dec 18, 2025

[skyrl-train] Megatron LoRA NovaSky-AI/SkyRL#743

Merged

2 tasks

update comments

b90af9e

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot bot temporarily deployed to nemo-ci December 18, 2025 20:20 Inactive

copy-pr-bot bot temporarily deployed to test December 18, 2025 20:20 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 19, 2025 08:12 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci December 19, 2025 08:19 Failure

update

8eb93ef

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot bot temporarily deployed to nemo-ci December 19, 2025 18:24 Inactive

copy-pr-bot bot temporarily deployed to test December 19, 2025 18:24 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci December 21, 2025 21:48 Failure

yaoyu-33 enabled auto-merge (squash) December 26, 2025 16:49

copy-pr-bot bot temporarily deployed to test December 27, 2025 06:07 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 27, 2025 09:46 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 27, 2025 09:50 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 27, 2025 10:23 Inactive

suiyoubi approved these changes Dec 28, 2025

View reviewed changes

yaoyu-33 merged commit 953aabf into main Dec 28, 2025
56 of 58 checks passed

yaoyu-33 deleted the feature/peft-recompute-hook branch December 28, 2025 05:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix PEFT recompute hook#1762

fix: Fix PEFT recompute hook#1762
yaoyu-33 merged 4 commits intomainfrom
feature/peft-recompute-hook

yaoyu-33 commented Dec 17, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Dec 17, 2025

Uh oh!

HollowMan6 commented Dec 17, 2025 •

edited

Loading

Uh oh!

yaoyu-33 commented Dec 18, 2025

Uh oh!

yaoyu-33 commented Dec 18, 2025

Uh oh!

erictang000 commented Dec 19, 2025

Uh oh!

yaoyu-33 commented Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yaoyu-33 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

copy-pr-bot bot commented Dec 17, 2025

Uh oh!

HollowMan6 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaoyu-33 commented Dec 18, 2025

Uh oh!

yaoyu-33 commented Dec 18, 2025

Uh oh!

erictang000 commented Dec 19, 2025

Uh oh!

yaoyu-33 commented Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yaoyu-33 commented Dec 17, 2025 •

edited

Loading

HollowMan6 commented Dec 17, 2025 •

edited

Loading