Refactor LLM pretraining examples #7159

maanug-nv · 2023-08-03T20:12:19Z

What does this PR do ?

Simplify LLM pretraining example scripts by moving common logic into a new TrainerBuilder and exp_manager.

Collection: [Note which collection this PR will affect]

Changelog

add a TrainerBuilder type that hides some common logic for setting up a Trainer
~~move logic to handle resume_from_checkpoint arg to exp_manager~~ Current disabling resume_from_checkpoint logic and leaving a TODO comment since its current behavior is not a desirable user experience. Will make a separate PR to improve this.
utilize above refactors to reduce length of Megatron LLM pretraining example scripts

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation
Refactor

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

maanug-nv · 2023-08-07T22:18:47Z

Rebased to include PTL 2.0 changes from #6433

examples/nlp/language_modeling/conf/megatron_gpt_config.yaml

nemo/utils/exp_manager.py

Signed-off-by: Maanu Grover <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Maanu Grover <[email protected]>

athitten · 2023-08-16T00:20:57Z

nemo/utils/exp_manager.py

-    if cfg.resume_from_checkpoint is not None:
-        trainer.ckpt_path = cfg.resume_from_checkpoint
+    #  TODO: this behavior is undesirable, need ckpts in exp_dir to take priority if present over resume_from_checkpoint
+    # if cfg.resume_from_checkpoint is not None:


@maanug-nv where are we taking care of the below lines then:

if cfg.model.resume_from_checkpoint is not None: trainer.ckpt_path = cfg.model.resume_from_checkpoint logging.info(f'Resuming training from checkpoint: {trainer.ckpt_path}')

Since the pre training scripts were assigning the checkpoint to trainer.ckpt_path if we passed a checkpoint path for resume_from_checkpoint under model in config.

Yes, so initially I moved those lines exactly as is to this place in exp_manager.py. After testing and discussing with @titu1994 , having those lines here (or in pretraining as they were before/are currently on main) has some undesirable behavior, details below. I wanted to keep this PR purely refactor (thought that would get it merged faster), so I'll correct the behavior in another PR. I can uncomment these lines if you prefer.

if 'resume_from_checkpoint' is set, that checkpoint is always used despite what is in the log dir. What makes more sense is that 'resume_from_checkpoint' is used if no log_dir is present, but log_dir takes priority if present.

ericharper

LGTM. Thanks!

@maanug-nv will address resume_from_checkpoint in a follow up PR

* add builder class Signed-off-by: Maanu Grover <[email protected]> * formatting Signed-off-by: Maanu Grover <[email protected]> * use trainer builder for gpt pretraining example Signed-off-by: Maanu Grover <[email protected]> * subclass trainer builder for bert Signed-off-by: Maanu Grover <[email protected]> * use trainer builder for bert pretraining example Signed-off-by: Maanu Grover <[email protected]> * subclass t5 builder and use in t5 pretraining Signed-off-by: Maanu Grover <[email protected]> * move resume_from_checkpoint logic to exp_manager Signed-off-by: Maanu Grover <[email protected]> * add docstring for resume_from_checkpoint Signed-off-by: Maanu Grover <[email protected]> * set resume_from_checkpoint with interpolation Signed-off-by: Maanu Grover <[email protected]> * remove refactored lines Signed-off-by: Maanu Grover <[email protected]> * unused import Signed-off-by: Maanu Grover <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * another unused import Signed-off-by: Maanu Grover <[email protected]> * bug fix Signed-off-by: Maanu Grover <[email protected]> * another bug missed in rebase Signed-off-by: Maanu Grover <[email protected]> * add copyright Signed-off-by: Maanu Grover <[email protected]> * add type annotation Signed-off-by: Maanu Grover <[email protected]> * docstrings for trainer builder Signed-off-by: Maanu Grover <[email protected]> * move trainer builder file Signed-off-by: Maanu Grover <[email protected]> * not needed for ptl 2.0 Signed-off-by: Maanu Grover <[email protected]> * disable resume_from_checkpoint logic in exp_manager Signed-off-by: Maanu Grover <[email protected]> --------- Signed-off-by: Maanu Grover <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Abhishree Thittenamane <[email protected]> Signed-off-by: dorotat <[email protected]>

* add builder class Signed-off-by: Maanu Grover <[email protected]> * formatting Signed-off-by: Maanu Grover <[email protected]> * use trainer builder for gpt pretraining example Signed-off-by: Maanu Grover <[email protected]> * subclass trainer builder for bert Signed-off-by: Maanu Grover <[email protected]> * use trainer builder for bert pretraining example Signed-off-by: Maanu Grover <[email protected]> * subclass t5 builder and use in t5 pretraining Signed-off-by: Maanu Grover <[email protected]> * move resume_from_checkpoint logic to exp_manager Signed-off-by: Maanu Grover <[email protected]> * add docstring for resume_from_checkpoint Signed-off-by: Maanu Grover <[email protected]> * set resume_from_checkpoint with interpolation Signed-off-by: Maanu Grover <[email protected]> * remove refactored lines Signed-off-by: Maanu Grover <[email protected]> * unused import Signed-off-by: Maanu Grover <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * another unused import Signed-off-by: Maanu Grover <[email protected]> * bug fix Signed-off-by: Maanu Grover <[email protected]> * another bug missed in rebase Signed-off-by: Maanu Grover <[email protected]> * add copyright Signed-off-by: Maanu Grover <[email protected]> * add type annotation Signed-off-by: Maanu Grover <[email protected]> * docstrings for trainer builder Signed-off-by: Maanu Grover <[email protected]> * move trainer builder file Signed-off-by: Maanu Grover <[email protected]> * not needed for ptl 2.0 Signed-off-by: Maanu Grover <[email protected]> * disable resume_from_checkpoint logic in exp_manager Signed-off-by: Maanu Grover <[email protected]> --------- Signed-off-by: Maanu Grover <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Abhishree Thittenamane <[email protected]>

github-actions bot added the NLP label Aug 3, 2023

maanug-nv force-pushed the llm-trainer-builder branch 2 times, most recently from 3408fc8 to 38f31bd Compare August 7, 2023 22:17

maanug-nv marked this pull request as ready for review August 7, 2023 22:20

maanug-nv force-pushed the llm-trainer-builder branch from 80868c2 to f139a3a Compare August 7, 2023 22:32

ericharper requested a review from athitten August 8, 2023 20:13

maanug-nv force-pushed the llm-trainer-builder branch 3 times, most recently from 186bab8 to 2f303ba Compare August 9, 2023 19:42

arendu requested review from arendu and removed request for arendu August 9, 2023 23:26

maanug-nv force-pushed the llm-trainer-builder branch from a7352cc to f3cf9bf Compare August 10, 2023 05:10

athitten reviewed Aug 10, 2023

View reviewed changes

examples/nlp/language_modeling/conf/megatron_gpt_config.yaml Show resolved Hide resolved

maanug-nv force-pushed the llm-trainer-builder branch 4 times, most recently from 3204f7f to e13b0af Compare August 11, 2023 01:02

github-advanced-security bot found potential problems Aug 11, 2023

View reviewed changes

nemo/utils/exp_manager.py Dismissed Show dismissed Hide dismissed

github-actions bot added the CI label Aug 11, 2023

maanug-nv force-pushed the llm-trainer-builder branch from 1020637 to e13b0af Compare August 12, 2023 01:24

github-actions bot removed the CI label Aug 12, 2023

maanug-nv force-pushed the llm-trainer-builder branch 4 times, most recently from 7b6dfd7 to 752aa3c Compare August 14, 2023 23:26

maanug-nv added 4 commits August 15, 2023 17:44

add builder class

7bce6ec

Signed-off-by: Maanu Grover <[email protected]>

formatting

a188faf

Signed-off-by: Maanu Grover <[email protected]>

use trainer builder for gpt pretraining example

6763963

Signed-off-by: Maanu Grover <[email protected]>

subclass trainer builder for bert

6fab8cc

Signed-off-by: Maanu Grover <[email protected]>

maanug-nv and others added 17 commits August 15, 2023 17:44

use trainer builder for bert pretraining example

bfd62b9

Signed-off-by: Maanu Grover <[email protected]>

subclass t5 builder and use in t5 pretraining

e18579d

Signed-off-by: Maanu Grover <[email protected]>

move resume_from_checkpoint logic to exp_manager

2cb1b0d

Signed-off-by: Maanu Grover <[email protected]>

add docstring for resume_from_checkpoint

77f1c86

Signed-off-by: Maanu Grover <[email protected]>

set resume_from_checkpoint with interpolation

0344f4e

Signed-off-by: Maanu Grover <[email protected]>

remove refactored lines

96dc68e

Signed-off-by: Maanu Grover <[email protected]>

unused import

caaebd0

Signed-off-by: Maanu Grover <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

84c0130

for more information, see https://pre-commit.ci

another unused import

703f3e4

Signed-off-by: Maanu Grover <[email protected]>

bug fix

ca25161

Signed-off-by: Maanu Grover <[email protected]>

another bug missed in rebase

fd098f0

Signed-off-by: Maanu Grover <[email protected]>

add copyright

f81cc4d

Signed-off-by: Maanu Grover <[email protected]>

add type annotation

744f9e4

Signed-off-by: Maanu Grover <[email protected]>

docstrings for trainer builder

5d8bb3a

Signed-off-by: Maanu Grover <[email protected]>

move trainer builder file

4040ccc

Signed-off-by: Maanu Grover <[email protected]>

not needed for ptl 2.0

d7000ee

Signed-off-by: Maanu Grover <[email protected]>

disable resume_from_checkpoint logic in exp_manager

a9c4a65

Signed-off-by: Maanu Grover <[email protected]>

maanug-nv force-pushed the llm-trainer-builder branch from 752aa3c to a9c4a65 Compare August 15, 2023 22:44

Merge branch 'main' into llm-trainer-builder

0278cc6

athitten reviewed Aug 16, 2023

View reviewed changes

ericharper approved these changes Aug 16, 2023

View reviewed changes

ericharper merged commit f90eea1 into NVIDIA:main Aug 16, 2023
10 of 11 checks passed

artbataev mentioned this pull request Aug 28, 2023

Move resume_from_checkpoint parameter: trainer -> exp_manager (for PTL 2.0) #7339

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor LLM pretraining examples #7159

Refactor LLM pretraining examples #7159

maanug-nv commented Aug 3, 2023 •

edited

Loading

maanug-nv commented Aug 7, 2023

athitten Aug 16, 2023

maanug-nv Aug 16, 2023 •

edited

Loading

ericharper left a comment

Refactor LLM pretraining examples #7159

Refactor LLM pretraining examples #7159

Conversation

maanug-nv commented Aug 3, 2023 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

maanug-nv commented Aug 7, 2023

athitten Aug 16, 2023

Choose a reason for hiding this comment

maanug-nv Aug 16, 2023 • edited Loading

Choose a reason for hiding this comment

ericharper left a comment

Choose a reason for hiding this comment

maanug-nv commented Aug 3, 2023 •

edited

Loading

maanug-nv Aug 16, 2023 •

edited

Loading