Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix for BF16 grad reductions with distopt #6340

Merged
merged 3 commits into from
Mar 31, 2023

Conversation

timmoon10
Copy link
Collaborator

What does this PR do ?

#5920 adds support for BF16 grad reductions with distopt, with embedding grad reductions done in FP32. @mikolajblaz found some bugs and it turns out the FP32 reductions were not being done at all. This PR fixes those issues. When I run GPT-3 175B, I confirm the embedding grads are now optimized with the FP32 optimizer. Loss values are the same with FP32 and BF16 grad reductions, up to 50 steps and within numerical accuracy.

This turned out messier than I would have liked. It would have been better to integrate distopt support for multiple grad dtypes into Apex.

Collection: NLP

Changelog

  • Debug distopt support for BF16 grad reductions

Usage

Set the optimizer to distributed_fused_adam in the config file:

Configure the optimizer with grad_sync_dtype: bf16.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

@timmoon10 timmoon10 added the bug Something isn't working label Mar 31, 2023
@github-actions github-actions bot added core Changes to NeMo Core NLP labels Mar 31, 2023
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@timmoon10 timmoon10 merged commit 5a99a86 into NVIDIA:main Mar 31, 2023
timmoon10 added a commit to timmoon10/NeMo that referenced this pull request Mar 31, 2023
* Debug distopt support for BF16 grad reductions

Signed-off-by: Tim Moon <[email protected]>

* Dump and load FP32 main params

Signed-off-by: Tim Moon <[email protected]>

* Style tweaks

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Mikołaj Błaż <[email protected]>
ericharper pushed a commit that referenced this pull request Apr 3, 2023
* GPT support for BF16 grad reductions (#5920)

* Add support for BF16 grad reductions with distopt

Signed-off-by: Tim Moon <[email protected]>

* Fix style issues

Signed-off-by: Tim Moon <[email protected]>

* Fix style issues

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>

* Add custom functions to launch distopt communication in interleaved pipeline parallelism (#6183)

Signed-off-by: Tim Moon <[email protected]>

* Bugfix for BF16 grad reductions with distopt (#6340)

* Debug distopt support for BF16 grad reductions

Signed-off-by: Tim Moon <[email protected]>

* Dump and load FP32 main params

Signed-off-by: Tim Moon <[email protected]>

* Style tweaks

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Mikołaj Błaż <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Mikołaj Błaż <[email protected]>
mikolajblaz added a commit to mikolajblaz/NeMo that referenced this pull request Apr 5, 2023
* Debug distopt support for BF16 grad reductions

Signed-off-by: Tim Moon <[email protected]>

* Dump and load FP32 main params

Signed-off-by: Tim Moon <[email protected]>

* Style tweaks

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Mikołaj Błaż <[email protected]>
mikolajblaz added a commit to mikolajblaz/NeMo that referenced this pull request Apr 5, 2023
* Debug distopt support for BF16 grad reductions

Signed-off-by: Tim Moon <[email protected]>

* Dump and load FP32 main params

Signed-off-by: Tim Moon <[email protected]>

* Style tweaks

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
ericharper pushed a commit that referenced this pull request Apr 5, 2023
* Debug distopt support for BF16 grad reductions



* Dump and load FP32 main params



* Style tweaks



---------

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
hsiehjackson pushed a commit to hsiehjackson/NeMo that referenced this pull request Jun 2, 2023
* Debug distopt support for BF16 grad reductions

Signed-off-by: Tim Moon <[email protected]>

* Dump and load FP32 main params

Signed-off-by: Tim Moon <[email protected]>

* Style tweaks

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Mikołaj Błaż <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Changes to NeMo Core NLP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants