[megatron] feat: LoRA adapter only refit (TensorLoRARequest)#4632
[megatron] feat: LoRA adapter only refit (TensorLoRARequest)#4632ISEEKYAN merged 1 commit intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a significant feature for LoRA adapter-only weight updates in the Megatron backend, which avoids the overhead of merging adapters into the base model for each weight synchronization. The changes are well-structured and include updates to documentation, configuration handling for LoRA, new PEFT utility functions for vLLM compatibility, and refactored weight export logic. The implementation appears solid and aligns with the stated goals. I have a couple of suggestions to enhance code maintainability and clarity regarding duplicated logic and a confusing condition.
f6706bc to
81d261e
Compare
23b8716 to
797dc7f
Compare
dddcffe to
91d4732
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable feature for LoRA adapter-only refitting in the Megatron backend, which should provide significant performance benefits. The changes across documentation, configuration, and worker implementations appear to correctly support both merging LoRA adapters and loading them separately. My primary concern, detailed in a specific comment, is the repeated implementation of LoRA configuration logic across several files. Addressing this will improve the long-term maintainability of the codebase.
1ad0f96 to
0c7b07a
Compare
49a4289 to
cb4bfe1
Compare
ef83292 to
8254e91
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a significant feature for LoRA adapter-only refit with the Megatron backend, which is a great addition for performance and flexibility. The changes are extensive, touching documentation, examples, configuration, and core worker logic. The implementation of separate synchronization for base and adapter weights is well-thought-out.
I've identified a few critical logic issues in the configuration handling for LoRA, particularly concerning the new merge option and backward compatibility with older LoRA settings. These issues are present in three different files, indicating that the logic is duplicated. For better maintainability, I recommend refactoring this configuration logic into a shared utility function to ensure consistency and make future updates easier. Addressing these points will significantly improve the robustness of this new feature.
a7aecaf to
ce759a7
Compare
ce759a7 to
7ba73c0
Compare
|
|
||
| import torch | ||
|
|
||
| # Map megatron lora target modules to HF-style module names for vLLM |
There was a problem hiding this comment.
This is only an advice, not necessary, can we move the lora mapping to megatron-bridge instead of keeping them in verl?
There was a problem hiding this comment.
Yes, I think this is something in discussion, and I guess it would need some API designs.
There was a problem hiding this comment.
add @yaoyu-33 for vis. We need to extend bridge APIs with capability of exporting lora weights.
|
cc: @wuxibin89 @vermouth1992 for further comments |
2629b6a to
7ad4c1d
Compare
43487b6 to
c49313f
Compare
Signed-off-by: Hollow Man <hollowman@opensuse.org>
c49313f to
30a0782
Compare
…oject#4632) ### What does this PR do? <img width="2206" height="1314" alt="lora-performance" src="https://github.com/user-attachments/assets/0482f423-01a3-4e52-a7ee-8b9cd79b7b1a" /> <img width="2208" height="1800" alt="lora-critic-val-score" src="https://github.com/user-attachments/assets/6ce10400-8164-47d8-90a6-c1bf002fb9e8" /> <img width="2204" height="1794" alt="lora-actor-plus-rollout-mismatch" src="https://github.com/user-attachments/assets/092d3a43-4eba-425e-a584-8d83c1f02de4" /> ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
…oject#4632) ### What does this PR do? <img width="2206" height="1314" alt="lora-performance" src="https://github.com/user-attachments/assets/0482f423-01a3-4e52-a7ee-8b9cd79b7b1a" /> <img width="2208" height="1800" alt="lora-critic-val-score" src="https://github.com/user-attachments/assets/6ce10400-8164-47d8-90a6-c1bf002fb9e8" /> <img width="2204" height="1794" alt="lora-actor-plus-rollout-mismatch" src="https://github.com/user-attachments/assets/092d3a43-4eba-425e-a584-8d83c1f02de4" /> ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
…oject#4632) ### What does this PR do? <img width="2206" height="1314" alt="lora-performance" src="https://github.com/user-attachments/assets/0482f423-01a3-4e52-a7ee-8b9cd79b7b1a" /> <img width="2208" height="1800" alt="lora-critic-val-score" src="https://github.com/user-attachments/assets/6ce10400-8164-47d8-90a6-c1bf002fb9e8" /> <img width="2204" height="1794" alt="lora-actor-plus-rollout-mismatch" src="https://github.com/user-attachments/assets/092d3a43-4eba-425e-a584-8d83c1f02de4" /> ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)