Add distopt support for FP8 params and BF16 optimizer state #7909

timmoon10 · 2023-11-17T23:39:03Z

What does this PR do ?

Adds support in the Apex distributed Adam optimizer to support FP8 parameters (using experimental FP8 tensors from Transformer Engine) and BF16 optimizer state.

Collection: NLP

Changelog

Adds distributed optimizer support for FP8 parameters
Adds the option to initialize GPT with FP8 parameters
Add support for non-FP32 distributed optimizer state

Usage

Run GPT, e.g. with the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml.

Enable FP8 support with model.fp8=True, FP8 parameters with model.fp8_params=True, the distributed optimizer with model.optim.name=distributed_fused_adam, and BF16 optimizer state with model.optim.dtype=bf16.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Equivalent to Distributed optimizer support for experimental FP8 tensors #7885 and Add support for non-FP32 distributed optimizer state #7895.
Depends on Distributed optimizer support for contiguous param buffer with FP8 params apex#1749
Depends on ~~[PyTorch] Float8Tensor uses cached transpose if available TransformerEngine#524~~ [PyTorch] Support pickling Float8Tensor TransformerEngine#529
Uses Transformer Engine Float8Tensor added in [PyTorch] Experimental FP8 tensor class TransformerEngine#452
Closes Distributed optimizer support for experimental FP8 tensors #7469, closes Distributed optimizer support for experimental FP8 tensors (r1.20.0 branch) #7565, closes Distributed optimizer support for experimental FP8 tensors #7885, closes Add support for non-FP32 distributed optimizer state #7895.

nemo/core/optim/distributed_adam.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

nemo/core/optim/distributed_adam.py

chiendb97 · 2023-11-20T10:59:46Z

I trained the llama model with 2 nodes using model.fp8=True, model.fp8_params=True, model.optim.name=distributed_fused_adam, model.optim.dtype=bf16.
I got this error when saving checkpoint

How can i solve this problem

Thank you!

timmoon10 · 2023-11-21T23:30:39Z

@chiendb97 Thanks for the report! I reproduce the error and have a fix at NVIDIA/TransformerEngine#529. I haven't tested thoroughly, but I was able to save and load a checkpoint for LLaMa with FP8 params.

lhb8125 · 2023-12-08T02:36:52Z

@timmoon10 Is there any blocker for the review? With this PR, the memory allocation per param is: 1(fp8)+1(fp8 transpose)+2(BF16 gradient)+[2(bf16 weight)+2(bf16 momentum)+2(bf16 variance)]/dp=4+6/dp. Is my understanding correct?

Signed-off-by: Tim Moon <[email protected]>

nemo/core/optim/distributed_adam.py

timmoon10 · 2023-12-12T17:47:10Z

@lhb8125 The only remaining blocker is fixing some convergence issues when running with SFT. I don't fully understand that issue yet, but I don't expect it would require major changes. Starting a review would be great.

timmoon10 · 2023-12-14T21:01:59Z

I've found a bug when using this PR with LLaMa SFT. Bugfix: NVIDIA/TransformerEngine#567

This does not affect GPT pretraining though. I think this is ready to review and merge.

ericharper · 2023-12-15T20:40:47Z

jenkins

ericharper

LGTM. Thanks!

timmoon10 · 2023-12-15T22:42:11Z

The Jenkins failure is because it is using an old version of Apex. The Dockerfile and README have been updated with the required Apex.

ericharper · 2023-12-16T01:25:41Z

What Base PyTorch version is needed? We can update it in the Jenkinsfile

Signed-off-by: Tim Moon <[email protected]>

erhoo82 · 2024-01-03T23:19:58Z

@athitten
Can you help review this PR? I think @ericharper is away.

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-01-09T01:17:57Z

jenkins

erhoo82

LGTM

timmoon10 · 2024-01-09T20:16:21Z

jenkins

erhoo82 · 2024-01-09T23:08:47Z

@timmoon10 Can you check why this fails the CI?

timmoon10 · 2024-01-09T23:26:08Z

The error message is "No space left on device", so I suspect it's related to the recent file system issues on DLCluster. I find I often need to run a couple times to get past these errors, as well as segfaults coming from ASR.

timmoon10 · 2024-01-10T01:34:17Z

jenkins

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-01-11T06:30:53Z

jenkins

timmoon10 · 2024-01-11T07:23:35Z

jenkins

timmoon10 · 2024-01-11T18:16:52Z

jenkins

ericharper · 2024-01-12T01:22:49Z

jenkins

ericharper

LGTM. Thanks!

ericharper · 2024-01-12T04:06:48Z

jenkins

…VIDIA#7909)" This reverts commit 6082d76.

) * Add distopt support for FP8 params and BF16 optimizer state Signed-off-by: Tim Moon <[email protected]> * Removed unused import Signed-off-by: Tim Moon <[email protected]> * Update PyTorch container in Jenkins pipeline Signed-off-by: Tim Moon <[email protected]> * Use custom container with Apex bugfixes See NVIDIA/apex#1760. Signed-off-by: Tim Moon <[email protected]> * Upgrade to PyTorch 23.11 container Signed-off-by: Tim Moon <[email protected]> * Update Apex commit Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Eric Harper <[email protected]>

) * Add distopt support for FP8 params and BF16 optimizer state Signed-off-by: Tim Moon <[email protected]> * Removed unused import Signed-off-by: Tim Moon <[email protected]> * Update PyTorch container in Jenkins pipeline Signed-off-by: Tim Moon <[email protected]> * Use custom container with Apex bugfixes See NVIDIA/apex#1760. Signed-off-by: Tim Moon <[email protected]> * Upgrade to PyTorch 23.11 container Signed-off-by: Tim Moon <[email protected]> * Update Apex commit Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Sasha Meister <[email protected]>

) * Add distopt support for FP8 params and BF16 optimizer state Signed-off-by: Tim Moon <[email protected]> * Removed unused import Signed-off-by: Tim Moon <[email protected]> * Update PyTorch container in Jenkins pipeline Signed-off-by: Tim Moon <[email protected]> * Use custom container with Apex bugfixes See NVIDIA/apex#1760. Signed-off-by: Tim Moon <[email protected]> * Upgrade to PyTorch 23.11 container Signed-off-by: Tim Moon <[email protected]> * Update Apex commit Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Eric Harper <[email protected]>

timmoon10 requested a review from erhoo82 November 17, 2023 23:39

github-actions bot added core Changes to NeMo Core NLP CI labels Nov 17, 2023

timmoon10 removed core Changes to NeMo Core CI labels Nov 17, 2023

github-actions bot added core Changes to NeMo Core CI labels Nov 17, 2023

github-advanced-security bot found potential problems Nov 18, 2023

View reviewed changes

nemo/core/optim/distributed_adam.py Fixed Show fixed Hide fixed

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Fixed Show fixed Hide fixed

nemo/core/optim/distributed_adam.py Fixed Show fixed Hide fixed

timmoon10 mentioned this pull request Nov 19, 2023

[PyTorch] Float8Tensor uses cached transpose if available NVIDIA/TransformerEngine#524

Closed

timmoon10 mentioned this pull request Nov 21, 2023

[PyTorch] Support pickling Float8Tensor NVIDIA/TransformerEngine#529

Merged

Add distopt support for FP8 params and BF16 optimizer state

b031db6

Signed-off-by: Tim Moon <[email protected]>

timmoon10 force-pushed the distopt-fp8-bf16-state branch from 8c81a3b to b031db6 Compare December 11, 2023 22:26

github-advanced-security bot found potential problems Dec 11, 2023

View reviewed changes

nemo/core/optim/distributed_adam.py Fixed Show fixed Hide fixed

timmoon10 mentioned this pull request Dec 14, 2023

Update fp8_meta amax when copying into Float8Tensor NVIDIA/TransformerEngine#567

Merged

Merge branch 'main' into distopt-fp8-bf16-state

f3b6c82

timmoon10 requested a review from ericharper December 15, 2023 17:15

Merge branch 'main' into distopt-fp8-bf16-state

67d34a2

ericharper previously approved these changes Dec 15, 2023

View reviewed changes

timmoon10 added 3 commits December 18, 2023 20:41

Merge branch 'main' into distopt-fp8-bf16-state

bd3c42a

Signed-off-by: Tim Moon <[email protected]>

Removed unused import

12fd07c

Signed-off-by: Tim Moon <[email protected]>

Update PyTorch container in Jenkins pipeline

dd22a18

Signed-off-by: Tim Moon <[email protected]>

timmoon10 requested review from erhoo82 and ericharper January 2, 2024 18:48

erhoo82 previously approved these changes Jan 3, 2024

View reviewed changes

Merge branch 'main' into distopt-fp8-bf16-state

fab79e4

Signed-off-by: Tim Moon <[email protected]>

timmoon10 dismissed erhoo82’s stale review via fab79e4 January 8, 2024 23:45

erhoo82 previously approved these changes Jan 9, 2024

View reviewed changes

Merge branch 'main' into distopt-fp8-bf16-state

3af2ed5

Merge branch 'main' into distopt-fp8-bf16-state

d0b93e3

Signed-off-by: Tim Moon <[email protected]>

timmoon10 dismissed erhoo82’s stale review via d0b93e3 January 10, 2024 17:38

Merge branch 'main' into distopt-fp8-bf16-state

d2a7b67

Merge branch 'main' into distopt-fp8-bf16-state

84ef6b7

Merge branch 'main' into distopt-fp8-bf16-state

e586072

ericharper approved these changes Jan 12, 2024

View reviewed changes

ericharper merged commit 6082d76 into NVIDIA:main Jan 12, 2024
11 checks passed

minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 17, 2024

Revert "Add distopt support for FP8 params and BF16 optimizer state (N…

2dddde4

…VIDIA#7909)" This reverts commit 6082d76.

timmoon10 mentioned this pull request Jan 23, 2024

Improve communication overlapping in FP8 distributed optimizer #8221

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distopt support for FP8 params and BF16 optimizer state #7909

Add distopt support for FP8 params and BF16 optimizer state #7909

timmoon10 commented Nov 17, 2023 •

edited

Loading

chiendb97 commented Nov 20, 2023

timmoon10 commented Nov 21, 2023

lhb8125 commented Dec 8, 2023

timmoon10 commented Dec 12, 2023 •

edited

Loading

timmoon10 commented Dec 14, 2023

ericharper commented Dec 15, 2023

ericharper left a comment

timmoon10 commented Dec 15, 2023

ericharper commented Dec 16, 2023

erhoo82 commented Jan 3, 2024

timmoon10 commented Jan 9, 2024

erhoo82 left a comment

timmoon10 commented Jan 9, 2024

erhoo82 commented Jan 9, 2024 •

edited

Loading

timmoon10 commented Jan 9, 2024 •

edited

Loading

timmoon10 commented Jan 10, 2024

timmoon10 commented Jan 11, 2024

timmoon10 commented Jan 11, 2024

timmoon10 commented Jan 11, 2024

ericharper commented Jan 12, 2024

ericharper left a comment

ericharper commented Jan 12, 2024

Add distopt support for FP8 params and BF16 optimizer state #7909

Add distopt support for FP8 params and BF16 optimizer state #7909

Conversation

timmoon10 commented Nov 17, 2023 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

chiendb97 commented Nov 20, 2023

timmoon10 commented Nov 21, 2023

lhb8125 commented Dec 8, 2023

timmoon10 commented Dec 12, 2023 • edited Loading

timmoon10 commented Dec 14, 2023

ericharper commented Dec 15, 2023

ericharper left a comment

Choose a reason for hiding this comment

timmoon10 commented Dec 15, 2023

ericharper commented Dec 16, 2023

erhoo82 commented Jan 3, 2024

timmoon10 commented Jan 9, 2024

erhoo82 left a comment

Choose a reason for hiding this comment

timmoon10 commented Jan 9, 2024

erhoo82 commented Jan 9, 2024 • edited Loading

timmoon10 commented Jan 9, 2024 • edited Loading

timmoon10 commented Jan 10, 2024

timmoon10 commented Jan 11, 2024

timmoon10 commented Jan 11, 2024

timmoon10 commented Jan 11, 2024

ericharper commented Jan 12, 2024

ericharper left a comment

Choose a reason for hiding this comment

ericharper commented Jan 12, 2024

timmoon10 commented Nov 17, 2023 •

edited

Loading

timmoon10 commented Dec 12, 2023 •

edited

Loading

erhoo82 commented Jan 9, 2024 •

edited

Loading

timmoon10 commented Jan 9, 2024 •

edited

Loading