Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for lightning 2.0 upgrade #7176

Merged
merged 10 commits into from
Aug 12, 2023
Merged

Fixes for lightning 2.0 upgrade #7176

merged 10 commits into from
Aug 12, 2023

Conversation

athitten
Copy link
Collaborator

@athitten athitten commented Aug 7, 2023

What does this PR do ?

Fix the following issues with the previous lightning 2.0 upgrade PR:

  1. Decrease val_check_interval to 1 while resuming from a ckpt in some CI tests where max_steps is same as that of training and lightning 2.0 errors: fit_loop.py#L259. Lightning 2.0 calls setup_data right at the beginning of fit_loop.run unlike 1.9 that did it only when self.skip for fit was False, hence the error din't show up with 1.9
  2. Remove the usage of trainer._checkpoint_connector as it's redundant in lightning 2.0 and use trainer.ckpt_path directly. Also having trainer._checkpoint_connector = _CheckpointConnector(trainer) overrode ckpt_path with None leading to starting training from scratch during resuming from a ckpt.
  3. Remove optimizer_idx arg in optimizer_step of MegatronHalfPrecisionPlugin class as the parent function in lightning does not have it anymore.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Jenkinsfile Outdated
@@ -3603,6 +3603,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
trainer.precision=16 \
trainer.gradient_clip_val=1.0 \
exp_manager.exp_dir=examples/nlp/language_modeling/gpt_pretrain_results \
exp_manager.resume_if_exits=False \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resume_if_exists*

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @titu1994, corrected the typo. Also the reason behind setting this to False is that this test aka Megatron GPT Pretraining and Resume Training PP=2 is picking up the last checkpoint from the previous test aka Megatron GPT with KERPLE Pretraining and Resume Training TP=2 and it errors out as it tries to find that checkpoint in the current test's exp_dir, but can't find it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was happening bcoz I had commented these lines out by mistake in the lightning 2.0 upgrade PR: https://github.com/NVIDIA/NeMo/blob/main/Jenkinsfile#L3583-L3584. Fixing it ASAP in a separate PR.

@athitten athitten force-pushed the athitten/ptl_2.0_fixes branch 2 times, most recently from 2cdeb16 to d0140e2 Compare August 10, 2023 00:14
@github-actions github-actions bot added the core Changes to NeMo Core label Aug 10, 2023
@athitten athitten force-pushed the athitten/ptl_2.0_fixes branch 2 times, most recently from 09c2033 to 3a747c8 Compare August 10, 2023 05:17
@github-actions github-actions bot removed core Changes to NeMo Core CI labels Aug 10, 2023
@github-actions github-actions bot added the CI label Aug 11, 2023
1) Add resume_if_exists=False for Megatron GPT Pretraining and Resume Training PP=2 as it can resume from the checkpoint of the previous model test in CI leading to Error

Signed-off-by: Abhishree <[email protected]>
1) Remove arg optimizer_idx in optimizer_step func as the arg is not used my parent func of lightning

Signed-off-by: Abhishree <[email protected]>
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ericharper ericharper marked this pull request as ready for review August 12, 2023 01:39
@ericharper ericharper merged commit ab749e4 into main Aug 12, 2023
15 checks passed
@ericharper ericharper deleted the athitten/ptl_2.0_fixes branch August 12, 2023 04:51
guyueh1 pushed a commit to guyueh1/NeMo that referenced this pull request Aug 14, 2023
* Remove trainer._checkpoint_connector = _CheckpointConnector(trainer)

Signed-off-by: Abhishree <[email protected]>

* Remove trainer._checkpoint_connector = _CheckpointConnector(trainer) for NMT

Signed-off-by: Abhishree <[email protected]>

* Add resume_if_exists=False in JenkinsFile

1) Add resume_if_exists=False for Megatron GPT Pretraining and Resume Training PP=2 as it can resume from the checkpoint of the previous model test in CI leading to Error

Signed-off-by: Abhishree <[email protected]>

* Remove optimizer_idx in optimizer_step and fix typo

1) Remove arg optimizer_idx in optimizer_step func as the arg is not used my parent func of lightning

Signed-off-by: Abhishree <[email protected]>

* Remove resume_if_exists=False in JenkinsFile

Signed-off-by: Abhishree <[email protected]>

* Make trainer.val_check_interval=1 for few tests in JenkinsFile

Signed-off-by: Abhishree <[email protected]>

* Change val_check_interval in JenkinsFile during resume to less than len(dataloader)

Signed-off-by: Abhishree <[email protected]>

* Change val_check_interval to 1 for Megatron T5 with ALiBi resume

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
dorotat-nv pushed a commit to dorotat-nv/NeMo that referenced this pull request Aug 24, 2023
* Remove trainer._checkpoint_connector = _CheckpointConnector(trainer)

Signed-off-by: Abhishree <[email protected]>

* Remove trainer._checkpoint_connector = _CheckpointConnector(trainer) for NMT

Signed-off-by: Abhishree <[email protected]>

* Add resume_if_exists=False in JenkinsFile

1) Add resume_if_exists=False for Megatron GPT Pretraining and Resume Training PP=2 as it can resume from the checkpoint of the previous model test in CI leading to Error

Signed-off-by: Abhishree <[email protected]>

* Remove optimizer_idx in optimizer_step and fix typo

1) Remove arg optimizer_idx in optimizer_step func as the arg is not used my parent func of lightning

Signed-off-by: Abhishree <[email protected]>

* Remove resume_if_exists=False in JenkinsFile

Signed-off-by: Abhishree <[email protected]>

* Make trainer.val_check_interval=1 for few tests in JenkinsFile

Signed-off-by: Abhishree <[email protected]>

* Change val_check_interval in JenkinsFile during resume to less than len(dataloader)

Signed-off-by: Abhishree <[email protected]>

* Change val_check_interval to 1 for Megatron T5 with ALiBi resume

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: dorotat <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* Remove trainer._checkpoint_connector = _CheckpointConnector(trainer)

Signed-off-by: Abhishree <[email protected]>

* Remove trainer._checkpoint_connector = _CheckpointConnector(trainer) for NMT

Signed-off-by: Abhishree <[email protected]>

* Add resume_if_exists=False in JenkinsFile

1) Add resume_if_exists=False for Megatron GPT Pretraining and Resume Training PP=2 as it can resume from the checkpoint of the previous model test in CI leading to Error

Signed-off-by: Abhishree <[email protected]>

* Remove optimizer_idx in optimizer_step and fix typo

1) Remove arg optimizer_idx in optimizer_step func as the arg is not used my parent func of lightning

Signed-off-by: Abhishree <[email protected]>

* Remove resume_if_exists=False in JenkinsFile

Signed-off-by: Abhishree <[email protected]>

* Make trainer.val_check_interval=1 for few tests in JenkinsFile

Signed-off-by: Abhishree <[email protected]>

* Change val_check_interval in JenkinsFile during resume to less than len(dataloader)

Signed-off-by: Abhishree <[email protected]>

* Change val_check_interval to 1 for Megatron T5 with ALiBi resume

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants