Skip to content

fix: Fix PEFT recompute hook#1762

Merged
yaoyu-33 merged 4 commits intomainfrom
feature/peft-recompute-hook
Dec 28, 2025
Merged

fix: Fix PEFT recompute hook#1762
yaoyu-33 merged 4 commits intomainfrom
feature/peft-recompute-hook

Conversation

@yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Dec 17, 2025

Summary

  • add a reusable recompute helper (with attribution) to ensure checkpointed transformer blocks see grad-requiring inputs during adapter-only finetuning
  • invoke the helper when PEFT transforms are applied so adapter-only runs no longer drop gradients
  • cover the helper with a focused unit test exercising the grad-enable path and duplicate patch guard

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33 yaoyu-33 changed the title Fix PEFT recompute hook fix: Fix PEFT recompute hook Dec 17, 2025
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@HollowMan6
Copy link
Contributor

HollowMan6 commented Dec 17, 2025

@yaoyu-33 Is it possible to apply the patching logic inside __call__(self, model: ModelType, training: bool = True) in https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/174298a35c87f2d0e14ef0d3cf91f1c030de70b6/src/megatron/bridge/peft/base.py directly? I just found that the current patching place seems to only apply to SFT ( Verl doesn't call the logics in src/megatron/bridge/training/setup.py), that way this will benefit for all downstream projects. #1750 (comment)

Reference: https://github.com/volcengine/verl/blob/16a6c4791c5f829c4a0c207ee3a086e90f855157/verl/utils/megatron_utils.py#L218-L222

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Contributor Author

@HollowMan6 updated.

@yaoyu-33
Copy link
Contributor Author

/ok to test b90af9e

@erictang000
Copy link

confirmed that this is working for me!
image

pink is with the fix (blue is without activation checkpointing, purple is without LoRA)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Contributor Author

/ok to test 8eb93ef

@yaoyu-33 yaoyu-33 merged commit 953aabf into main Dec 28, 2025
56 of 58 checks passed
@yaoyu-33 yaoyu-33 deleted the feature/peft-recompute-hook branch December 28, 2025 05:22
erictang000 added a commit to NovaSky-AI/SkyRL that referenced this pull request Dec 28, 2025
Enables LoRA training with the Megatron Backend. Currently waiting for
NVIDIA-NeMo/Megatron-Bridge#1762 to be merged
into main, so we can at least pin a commit rather than a branch for
stability.

- Adds
[LoRA](https://docs.nvidia.com/nemo/megatron-bridge/0.2.0/apidocs/bridge/bridge.peft.lora.html)
support via Megatron-Bridge
- Adds custom checkpointing for LoRA model parameters (until LoRA
checkpointing logic is upstreamed to Megatron-Bridge).
- Weight syncing logic for Megatron + LoRA is handled by merging the
LoRA parameters back into the base model before exporting to vLLM. This
means that for megatron lora (for now), lora does not have to be
configured for vLLM.

## Examples

GSM8K for Qwen3-30B-MoE and Qwen3-0.6B converging:
<img width="1087" height="808" alt="image"
src="https://github.com/user-attachments/assets/95e03b75-4a8c-4734-8f55-2cf535b04876"
/>

- Qwen3-30B-A3B previously required 2 H100 nodes for full parameter fine
tuning - we can increase batch size compared to previous runs with LoRA
on just 1 H100 node!

### DAPO Qwen-4B
With TIS - megatron dense backend can match/exceed FSDP backend perf.
TIS is especially important for the current version of LoRA. Canonical
Lora seems to be less good than "performant lora" - or maybe more
sensitive to learning rate.
<img width="1214" height="814" alt="image"
src="https://github.com/user-attachments/assets/4c2d2b37-f835-4e53-ac54-7e54812b6006"
/>


Blockers/TODOs:
- [x] ~~For Dense models, LoRA results in low grad norm/0 ppo_clip_ratio
unless pp > 1. Something on megatron-core or megatron-bridge is broken
for dense models.~~ Issue tracked on Megatron-Bridge
(NVIDIA-NeMo/Megatron-Bridge#1750), awaiting
PR NVIDIA-NeMo/Megatron-Bridge#1762
- [x] Test out MoE models

## Future Work
- Once Megatron-Bridge support for exporting only lora parameters is
done, we should support just syncing these to vLLM for lower
communication cost
- Add support for other LoRA variants from Megatron-Bridge (canonical
lora, qlora, dora).
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
Enables LoRA training with the Megatron Backend. Currently waiting for
NVIDIA-NeMo/Megatron-Bridge#1762 to be merged
into main, so we can at least pin a commit rather than a branch for
stability.

- Adds
[LoRA](https://docs.nvidia.com/nemo/megatron-bridge/0.2.0/apidocs/bridge/bridge.peft.lora.html)
support via Megatron-Bridge
- Adds custom checkpointing for LoRA model parameters (until LoRA
checkpointing logic is upstreamed to Megatron-Bridge).
- Weight syncing logic for Megatron + LoRA is handled by merging the
LoRA parameters back into the base model before exporting to vLLM. This
means that for megatron lora (for now), lora does not have to be
configured for vLLM.

## Examples

GSM8K for Qwen3-30B-MoE and Qwen3-0.6B converging:
<img width="1087" height="808" alt="image"
src="https://github.com/user-attachments/assets/95e03b75-4a8c-4734-8f55-2cf535b04876"
/>

- Qwen3-30B-A3B previously required 2 H100 nodes for full parameter fine
tuning - we can increase batch size compared to previous runs with LoRA
on just 1 H100 node!

### DAPO Qwen-4B
With TIS - megatron dense backend can match/exceed FSDP backend perf.
TIS is especially important for the current version of LoRA. Canonical
Lora seems to be less good than "performant lora" - or maybe more
sensitive to learning rate.
<img width="1214" height="814" alt="image"
src="https://github.com/user-attachments/assets/4c2d2b37-f835-4e53-ac54-7e54812b6006"
/>


Blockers/TODOs:
- [x] ~~For Dense models, LoRA results in low grad norm/0 ppo_clip_ratio
unless pp > 1. Something on megatron-core or megatron-bridge is broken
for dense models.~~ Issue tracked on Megatron-Bridge
(NVIDIA-NeMo/Megatron-Bridge#1750), awaiting
PR NVIDIA-NeMo/Megatron-Bridge#1762
- [x] Test out MoE models

## Future Work
- Once Megatron-Bridge support for exporting only lora parameters is
done, we should support just syncing these to vLLM for lower
communication cost
- Add support for other LoRA variants from Megatron-Bridge (canonical
lora, qlora, dora).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants