[megatron] feat: checkpoint save as HF PEFT format#5575
[megatron] feat: checkpoint save as HF PEFT format#5575ETOgaosion merged 2 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the PEFT checkpoint saving and loading mechanism to leverage the Megatron-Bridge, which is a positive change that centralizes logic and removes custom implementations. The overall approach is sound. I've identified a couple of areas for improvement related to code duplication and the use of private APIs, which could enhance the long-term maintainability of this critical checkpointing functionality.
There was a problem hiding this comment.
Pull request overview
This PR updates VERL’s Megatron-Bridge PEFT checkpointing flow to rely on Megatron’s distributed checkpointing and adds support for saving PEFT adapters in HuggingFace (PEFT) format via Megatron-Bridge.
Changes:
- Switch PEFT adapter load in
make_megatron_module()from the repo’s custom adapter checkpoint format to Megatron distributed checkpoint loading with a PEFT filter. - Update
MegatronCheckpointManagerto filter model state dicts to adapter-only when PEFT is enabled, and to save HF PEFT adapters via a newbridge.save_hf_adapter()API. - Remove the legacy
*_adapter_checkpointsave/load utilities and exports frommegatron_peft_utils.py.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
verl/utils/megatron_utils.py |
Loads PEFT adapter weights through Megatron distributed checkpointing during pre-wrap PEFT transformation. |
verl/utils/megatron_peft_utils.py |
Removes legacy adapter-only checkpoint save/load helpers and related exports. |
verl/utils/checkpoint/megatron_checkpoint_manager.py |
Filters dist-checkpoint model state to adapter-only for PEFT, adjusts strictness on load, and adds HF PEFT adapter saving via Megatron-Bridge. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
5c00bb8 to
7346687
Compare
829c253 to
3b0306a
Compare
da08ac8 to
e9bbede
Compare
New API endpoints added on Megatron-Bridge side Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
e9bbede to
49c1125
Compare
ETOgaosion
left a comment
There was a problem hiding this comment.
Great work! Can I understand this as runnable PEFT format saving refactor? The saving methods before have some bugs?
Do we need to modify some docs in another PR?
|
Thank you!
Yes, mainly to move away from Verl's own customized PEFT checkpointing format (was introduced by me in #4063 hh) into the official APIs introduced by NVIDIA-NeMo/Megatron-Bridge#2574
I don't find bugs previously unless other people find anything, the main goal was to centralize the code to Megatron-Bridge and keep related Verl codebase clean.
The PEFT checkpointing format stuff was not documented in https://github.com/verl-project/verl/blob/main/docs/advance/ppo_lora.rst, so feel free to add a section about it. |
What does this PR do?
New API endpoints added on Megatron-Bridge side, need NVIDIA-NeMo/Megatron-Bridge#2574 (has been merged)
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.