[mcore] add offload param and opt function for magetron by BearBiscuit05 · Pull Request #1162 · verl-project/verl

BearBiscuit05 · 2025-04-19T07:30:18Z

Motivation

This is a PR that supports offload in Megatron. Currently, parameters, gradients, and optimizers can be offloaded to the CPU when not needed. I have successfully tested the feasibility of the function using the memory snap tool. Further accuracy testing is still in progress.

TODO

Accuracy testing

ccclyu · 2025-04-24T07:19:35Z

thanks for great work and will finish the testing on serval larger models like Qwen72b, Mixtral etc by the end of this week. cc: @ann-qin-lu

BearBiscuit05 · 2025-04-24T07:24:41Z

thanks for great work and will finish the testing on serval larger models like Qwen72b, Mixtral etc by the end of this week. cc: @ann-qin-lu

thx, but please note that there is currently an issue with the logger printing for GPU memory, mainly because vllm's memory release cannot be detected by torch. If you want to fix the issue, you can check PR #1118

ETOgaosion

There are urgent demands for this function in larger model training, the testing results show that it is functional, try merging first. Only if there are small amount of memory that are not able to be offloaded, patches can come along the way.

ccclyu

LGTM. can first merge for feature dev.

ETOgaosion · 2025-04-26T18:03:14Z

LGTM. can first merge for feature dev.

Thanks for approving. Any suggestions for future refactorization are appreciated.

ETOgaosion · 2025-04-26T19:14:46Z

@BearBiscuit05 Remember to add some documentation if changing configurations please~

…#1162) ## Motivation This is a PR that supports offload in Megatron. Currently, parameters, gradients, and optimizers can be offloaded to the CPU when not needed. I have successfully tested the feasibility of the function using the memory snap tool. Further accuracy testing is still in progress. ## TODO - [x] Accuracy testing

BearBiscuit05 added 16 commits April 18, 2025 17:29

[mcore] add function for offload model and optimizer

4163953

update

0c90c7d

fix

bd5eac9

update

8bd1ac8

Merge remote-tracking branch 'upstream/main' into xya/mcore/off

ff59329

success run in dpsk 1.5B

ff4b636

fix

b0e96de

Merge branch 'main' into xya/mcore/off

917eaeb

delete unused func for offload

72a80cb

update log

2d22913

support more model

07e1e02

update

6252e06

offload copy params

71113b6

update

49c2480

copy params has no grad

6e63eb9

update

259027a

BearBiscuit05 changed the title ~~[mcore] add offload param and opt function for magetron~~ [WIP] add offload param and opt function for magetron Apr 23, 2025

BearBiscuit05 added 4 commits April 23, 2025 13:28

update

14858ad

add gc collect

8e5adfc

delete useless info

c721d7f

lint

c5dce08

BearBiscuit05 changed the title ~~[WIP] add offload param and opt function for magetron~~ [mcore] add offload param and opt function for magetron Apr 23, 2025

add turning doc

cc5c898

BearBiscuit05 requested a review from ccclyu April 24, 2025 01:44

BearBiscuit05 added 2 commits April 24, 2025 09:47

Merge remote-tracking branch 'upstream/main' into xya/mcore/off

4c0c968

update

2e40847

BearBiscuit05 mentioned this pull request Apr 24, 2025

MCore zhaochenyang20/Awesome-ML-SYS-Tutorial#119

Open

BearBiscuit05 added 3 commits April 24, 2025 11:04

update log

7dbd470

delete log

c9b3832

update

9ca088a

ETOgaosion approved these changes Apr 25, 2025

View reviewed changes

ccclyu approved these changes Apr 26, 2025

View reviewed changes

ETOgaosion merged commit cc8fca5 into verl-project:main Apr 26, 2025
19 checks passed

BearBiscuit05 deleted the xya/mcore/off branch April 27, 2025 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mcore] add offload param and opt function for magetron#1162

[mcore] add offload param and opt function for magetron#1162
ETOgaosion merged 26 commits intoverl-project:mainfrom
BearBiscuit05:xya/mcore/off

BearBiscuit05 commented Apr 19, 2025 •

edited

Loading

Uh oh!

ccclyu commented Apr 24, 2025 •

edited

Loading

Uh oh!

BearBiscuit05 commented Apr 24, 2025

Uh oh!

ETOgaosion left a comment •

edited

Loading

Uh oh!

ccclyu left a comment

Uh oh!

ETOgaosion commented Apr 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

ETOgaosion commented Apr 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BearBiscuit05 commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

TODO

Uh oh!

ccclyu commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BearBiscuit05 commented Apr 24, 2025

Uh oh!

ETOgaosion left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccclyu left a comment

Choose a reason for hiding this comment

Uh oh!

ETOgaosion commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ETOgaosion commented Apr 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BearBiscuit05 commented Apr 19, 2025 •

edited

Loading

ccclyu commented Apr 24, 2025 •

edited

Loading

ETOgaosion left a comment •

edited

Loading

ETOgaosion commented Apr 26, 2025 •

edited

Loading