Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distopt support for FP8 params and BF16 optimizer state #7909

Merged
merged 21 commits into from
Jan 12, 2024

Conversation

timmoon10
Copy link
Collaborator

@timmoon10 timmoon10 commented Nov 17, 2023

What does this PR do ?

Adds support in the Apex distributed Adam optimizer to support FP8 parameters (using experimental FP8 tensors from Transformer Engine) and BF16 optimizer state.

Collection: NLP

Changelog

  • Adds distributed optimizer support for FP8 parameters
  • Adds the option to initialize GPT with FP8 parameters
  • Add support for non-FP32 distributed optimizer state

Usage

Run GPT, e.g. with the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml.

Enable FP8 support with model.fp8=True, FP8 parameters with model.fp8_params=True, the distributed optimizer with model.optim.name=distributed_fused_adam, and BF16 optimizer state with model.optim.dtype=bf16.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

@github-actions github-actions bot added core Changes to NeMo Core NLP CI labels Nov 17, 2023
@timmoon10 timmoon10 removed core Changes to NeMo Core CI labels Nov 17, 2023
@github-actions github-actions bot added core Changes to NeMo Core CI labels Nov 17, 2023
@chiendb97
Copy link

I trained the llama model with 2 nodes using model.fp8=True, model.fp8_params=True, model.optim.name=distributed_fused_adam, model.optim.dtype=bf16.
I got this error when saving checkpoint

Screenshot 2023-11-20 at 17 27 23

How can i solve this problem

Thank you!

@timmoon10
Copy link
Collaborator Author

@chiendb97 Thanks for the report! I reproduce the error and have a fix at NVIDIA/TransformerEngine#529. I haven't tested thoroughly, but I was able to save and load a checkpoint for LLaMa with FP8 params.

@lhb8125
Copy link
Contributor

lhb8125 commented Dec 8, 2023

@timmoon10 Is there any blocker for the review? With this PR, the memory allocation per param is: 1(fp8)+1(fp8 transpose)+2(BF16 gradient)+[2(bf16 weight)+2(bf16 momentum)+2(bf16 variance)]/dp=4+6/dp. Is my understanding correct?

@timmoon10
Copy link
Collaborator Author

timmoon10 commented Dec 12, 2023

@lhb8125 The only remaining blocker is fixing some convergence issues when running with SFT. I don't fully understand that issue yet, but I don't expect it would require major changes. Starting a review would be great.

@timmoon10
Copy link
Collaborator Author

I've found a bug when using this PR with LLaMa SFT. Bugfix: NVIDIA/TransformerEngine#567

This does not affect GPT pretraining though. I think this is ready to review and merge.

@ericharper
Copy link
Collaborator

jenkins

ericharper
ericharper previously approved these changes Dec 15, 2023
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@timmoon10
Copy link
Collaborator Author

The Jenkins failure is because it is using an old version of Apex. The Dockerfile and README have been updated with the required Apex.

@ericharper
Copy link
Collaborator

What Base PyTorch version is needed? We can update it in the Jenkinsfile

erhoo82
erhoo82 previously approved these changes Jan 3, 2024
@erhoo82
Copy link
Collaborator

erhoo82 commented Jan 3, 2024

@athitten
Can you help review this PR? I think @ericharper is away.

@timmoon10
Copy link
Collaborator Author

jenkins

erhoo82
erhoo82 previously approved these changes Jan 9, 2024
Copy link
Collaborator

@erhoo82 erhoo82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@timmoon10
Copy link
Collaborator Author

jenkins

@erhoo82
Copy link
Collaborator

erhoo82 commented Jan 9, 2024

@timmoon10 Can you check why this fails the CI?

@timmoon10
Copy link
Collaborator Author

timmoon10 commented Jan 9, 2024

The error message is "No space left on device", so I suspect it's related to the recent file system issues on DLCluster. I find I often need to run a couple times to get past these errors, as well as segfaults coming from ASR.

@timmoon10
Copy link
Collaborator Author

jenkins

@timmoon10
Copy link
Collaborator Author

jenkins

@timmoon10
Copy link
Collaborator Author

jenkins

@timmoon10
Copy link
Collaborator Author

jenkins

@ericharper
Copy link
Collaborator

jenkins

Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ericharper
Copy link
Collaborator

jenkins

@ericharper ericharper merged commit 6082d76 into NVIDIA:main Jan 12, 2024
11 checks passed
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 17, 2024
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 19, 2024
)

* Add distopt support for FP8 params and BF16 optimizer state

Signed-off-by: Tim Moon <[email protected]>

* Removed unused import

Signed-off-by: Tim Moon <[email protected]>

* Update PyTorch container in Jenkins pipeline

Signed-off-by: Tim Moon <[email protected]>

* Use custom container with Apex bugfixes

See NVIDIA/apex#1760.

Signed-off-by: Tim Moon <[email protected]>

* Upgrade to PyTorch 23.11 container

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
ssh-meister pushed a commit to ssh-meister/NeMo that referenced this pull request Feb 15, 2024
)

* Add distopt support for FP8 params and BF16 optimizer state

Signed-off-by: Tim Moon <[email protected]>

* Removed unused import

Signed-off-by: Tim Moon <[email protected]>

* Update PyTorch container in Jenkins pipeline

Signed-off-by: Tim Moon <[email protected]>

* Use custom container with Apex bugfixes

See NVIDIA/apex#1760.

Signed-off-by: Tim Moon <[email protected]>

* Upgrade to PyTorch 23.11 container

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Sasha Meister <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
)

* Add distopt support for FP8 params and BF16 optimizer state

Signed-off-by: Tim Moon <[email protected]>

* Removed unused import

Signed-off-by: Tim Moon <[email protected]>

* Update PyTorch container in Jenkins pipeline

Signed-off-by: Tim Moon <[email protected]>

* Use custom container with Apex bugfixes

See NVIDIA/apex#1760.

Signed-off-by: Tim Moon <[email protected]>

* Upgrade to PyTorch 23.11 container

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI core Changes to NeMo Core NLP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants