Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor LLM pretraining examples #7159

Merged
merged 22 commits into from
Aug 16, 2023
Merged

Conversation

maanug-nv
Copy link
Collaborator

@maanug-nv maanug-nv commented Aug 3, 2023

What does this PR do ?

Simplify LLM pretraining example scripts by moving common logic into a new TrainerBuilder and exp_manager.

Collection: [Note which collection this PR will affect]

Changelog

  • add a TrainerBuilder type that hides some common logic for setting up a Trainer
  • move logic to handle resume_from_checkpoint arg to exp_manager Current disabling resume_from_checkpoint logic and leaving a TODO comment since its current behavior is not a desirable user experience. Will make a separate PR to improve this.
  • utilize above refactors to reduce length of Megatron LLM pretraining example scripts

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Refactor

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the NLP label Aug 3, 2023
@maanug-nv maanug-nv force-pushed the llm-trainer-builder branch 2 times, most recently from 3408fc8 to 38f31bd Compare August 7, 2023 22:17
@maanug-nv
Copy link
Collaborator Author

Rebased to include PTL 2.0 changes from #6433

@maanug-nv maanug-nv marked this pull request as ready for review August 7, 2023 22:20
@maanug-nv maanug-nv force-pushed the llm-trainer-builder branch 3 times, most recently from 186bab8 to 2f303ba Compare August 9, 2023 19:42
@arendu arendu requested review from arendu and removed request for arendu August 9, 2023 23:26
@maanug-nv maanug-nv force-pushed the llm-trainer-builder branch 4 times, most recently from 3204f7f to e13b0af Compare August 11, 2023 01:02
nemo/utils/exp_manager.py Dismissed Show dismissed Hide dismissed
@github-actions github-actions bot added the CI label Aug 11, 2023
@github-actions github-actions bot removed the CI label Aug 12, 2023
@maanug-nv maanug-nv force-pushed the llm-trainer-builder branch 4 times, most recently from 7b6dfd7 to 752aa3c Compare August 14, 2023 23:26
maanug-nv and others added 17 commits August 15, 2023 17:44
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
if cfg.resume_from_checkpoint is not None:
trainer.ckpt_path = cfg.resume_from_checkpoint
# TODO: this behavior is undesirable, need ckpts in exp_dir to take priority if present over resume_from_checkpoint
# if cfg.resume_from_checkpoint is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maanug-nv where are we taking care of the below lines then:

    if cfg.model.resume_from_checkpoint is not None:
        trainer.ckpt_path = cfg.model.resume_from_checkpoint
    logging.info(f'Resuming training from checkpoint: {trainer.ckpt_path}')

Since the pre training scripts were assigning the checkpoint to trainer.ckpt_path if we passed a checkpoint path for resume_from_checkpoint under model in config.

Copy link
Collaborator Author

@maanug-nv maanug-nv Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so initially I moved those lines exactly as is to this place in exp_manager.py. After testing and discussing with @titu1994 , having those lines here (or in pretraining as they were before/are currently on main) has some undesirable behavior, details below. I wanted to keep this PR purely refactor (thought that would get it merged faster), so I'll correct the behavior in another PR. I can uncomment these lines if you prefer.

if 'resume_from_checkpoint' is set, that checkpoint is always used despite what is in the log dir. What makes more sense is that 'resume_from_checkpoint' is used if no log_dir is present, but log_dir takes priority if present.

Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@maanug-nv will address resume_from_checkpoint in a follow up PR

@ericharper ericharper merged commit f90eea1 into NVIDIA:main Aug 16, 2023
10 of 11 checks passed
dorotat-nv pushed a commit to dorotat-nv/NeMo that referenced this pull request Aug 24, 2023
* add builder class

Signed-off-by: Maanu Grover <[email protected]>

* formatting

Signed-off-by: Maanu Grover <[email protected]>

* use trainer builder for gpt pretraining example

Signed-off-by: Maanu Grover <[email protected]>

* subclass trainer builder for bert

Signed-off-by: Maanu Grover <[email protected]>

* use trainer builder for bert pretraining example

Signed-off-by: Maanu Grover <[email protected]>

* subclass t5 builder and use in t5 pretraining

Signed-off-by: Maanu Grover <[email protected]>

* move resume_from_checkpoint logic to exp_manager

Signed-off-by: Maanu Grover <[email protected]>

* add docstring for resume_from_checkpoint

Signed-off-by: Maanu Grover <[email protected]>

* set resume_from_checkpoint with interpolation

Signed-off-by: Maanu Grover <[email protected]>

* remove refactored lines

Signed-off-by: Maanu Grover <[email protected]>

* unused import

Signed-off-by: Maanu Grover <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* another unused import

Signed-off-by: Maanu Grover <[email protected]>

* bug fix

Signed-off-by: Maanu Grover <[email protected]>

* another bug missed in rebase

Signed-off-by: Maanu Grover <[email protected]>

* add copyright

Signed-off-by: Maanu Grover <[email protected]>

* add type annotation

Signed-off-by: Maanu Grover <[email protected]>

* docstrings for trainer builder

Signed-off-by: Maanu Grover <[email protected]>

* move trainer builder file

Signed-off-by: Maanu Grover <[email protected]>

* not needed for ptl 2.0

Signed-off-by: Maanu Grover <[email protected]>

* disable resume_from_checkpoint logic in exp_manager

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Signed-off-by: dorotat <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* add builder class

Signed-off-by: Maanu Grover <[email protected]>

* formatting

Signed-off-by: Maanu Grover <[email protected]>

* use trainer builder for gpt pretraining example

Signed-off-by: Maanu Grover <[email protected]>

* subclass trainer builder for bert

Signed-off-by: Maanu Grover <[email protected]>

* use trainer builder for bert pretraining example

Signed-off-by: Maanu Grover <[email protected]>

* subclass t5 builder and use in t5 pretraining

Signed-off-by: Maanu Grover <[email protected]>

* move resume_from_checkpoint logic to exp_manager

Signed-off-by: Maanu Grover <[email protected]>

* add docstring for resume_from_checkpoint

Signed-off-by: Maanu Grover <[email protected]>

* set resume_from_checkpoint with interpolation

Signed-off-by: Maanu Grover <[email protected]>

* remove refactored lines

Signed-off-by: Maanu Grover <[email protected]>

* unused import

Signed-off-by: Maanu Grover <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* another unused import

Signed-off-by: Maanu Grover <[email protected]>

* bug fix

Signed-off-by: Maanu Grover <[email protected]>

* another bug missed in rebase

Signed-off-by: Maanu Grover <[email protected]>

* add copyright

Signed-off-by: Maanu Grover <[email protected]>

* add type annotation

Signed-off-by: Maanu Grover <[email protected]>

* docstrings for trainer builder

Signed-off-by: Maanu Grover <[email protected]>

* move trainer builder file

Signed-off-by: Maanu Grover <[email protected]>

* not needed for ptl 2.0

Signed-off-by: Maanu Grover <[email protected]>

* disable resume_from_checkpoint logic in exp_manager

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants