Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for mutex timeout in distributed optimizer backward hook #9087

Merged
merged 5 commits into from
May 2, 2024

Conversation

minitu
Copy link
Contributor

@minitu minitu commented May 2, 2024

What does this PR do ?

Modified version of #9084, to debug a hang at

with self._lock:

Collection: NLP

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

Run GPT, e.g. with the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml.

Enable the distributed optimizer with model.optim.name=distributed_fused_adam and set the timeout with model.optim.lock_timeout=<seconds>.

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Sorry, something went wrong.

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
@github-actions github-actions bot added the core Changes to NeMo Core label May 2, 2024
@minitu minitu force-pushed the distopt-mutex-timeout branch from dd78d22 to f36140a Compare May 2, 2024 00:44
@minitu minitu changed the title Distopt mutex timeout Add option for mutex timeout in distributed optimizer backward hook May 2, 2024
Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
@minitu minitu force-pushed the distopt-mutex-timeout branch from f36140a to 972d1b6 Compare May 2, 2024 03:19
Jaemin Choi added 2 commits May 1, 2024 20:33
This reverts commit 972d1b6.

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
@minitu minitu marked this pull request as ready for review May 2, 2024 03:56
@minitu
Copy link
Contributor Author

minitu commented May 2, 2024

@ShriyaPalsamudram Tested on Eos single-node, the updated code spits out a RuntimeError when the lock is not obtained within the given timeout period.

@layalir
Copy link
Collaborator

layalir commented May 2, 2024

How were you able to create a timeout situation in one node? Did you set the timer to very very small value?

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
self._lock.release()
else:
# Failed to acquire lock before timeout
print(f'MegatronDistributedFusedAdam: Failed to acquire lock within {lock_timeout} seconds.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use logging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that we want the print to happen even with NeMo logging disabled, for hang debugging purposes. @ShriyaPalsamudram

@ericharper ericharper merged commit b2eccd2 into NVIDIA:r2.0.0.rc0.beta May 2, 2024
4 checks passed
github-actions bot pushed a commit that referenced this pull request May 2, 2024
…9087)

* Tim: Add option for timeout in distopt callback mutex

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>

* Replace parent's _lock

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>

* Revert "Replace parent's _lock"

This reverts commit 972d1b6.

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>

* Raise RuntimeError when timeout

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>

* Change RuntimeError to print

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>

---------

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
Co-authored-by: Jaemin Choi <jaeminc@nvidia.com>
michal2409 added a commit that referenced this pull request Jun 12, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
…9087) (#9091)

* Tim: Add option for timeout in distopt callback mutex



* Replace parent's _lock



* Revert "Replace parent's _lock"

This reverts commit 972d1b6.



* Raise RuntimeError when timeout



* Change RuntimeError to print



---------

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
Co-authored-by: Jaemin Choi <minitu77@gmail.com>
Co-authored-by: Jaemin Choi <jaeminc@nvidia.com>
Co-authored-by: Michal Futrega <mfutrega@nvidia.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
galv pushed a commit to galv/NeMo that referenced this pull request Jun 13, 2024
…VIDIA#9087) (NVIDIA#9091)

* Tim: Add option for timeout in distopt callback mutex



* Replace parent's _lock



* Revert "Replace parent's _lock"

This reverts commit 972d1b6.



* Raise RuntimeError when timeout



* Change RuntimeError to print



---------

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
Co-authored-by: Jaemin Choi <minitu77@gmail.com>
Co-authored-by: Jaemin Choi <jaeminc@nvidia.com>
Co-authored-by: Michal Futrega <mfutrega@nvidia.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
JesusPaz pushed a commit to JesusPaz/NeMo that referenced this pull request Jun 18, 2024

Verified

This commit was signed with the committer’s verified signature.
JesusPaz Jesus Paz
…VIDIA#9087) (NVIDIA#9091)

* Tim: Add option for timeout in distopt callback mutex



* Replace parent's _lock



* Revert "Replace parent's _lock"

This reverts commit 972d1b6.



* Raise RuntimeError when timeout



* Change RuntimeError to print



---------

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
Co-authored-by: Jaemin Choi <minitu77@gmail.com>
Co-authored-by: Jaemin Choi <jaeminc@nvidia.com>
Co-authored-by: Michal Futrega <mfutrega@nvidia.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
…VIDIA#9087) (NVIDIA#9091)

* Tim: Add option for timeout in distopt callback mutex



* Replace parent's _lock



* Revert "Replace parent's _lock"

This reverts commit 972d1b6.



* Raise RuntimeError when timeout



* Change RuntimeError to print



---------

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
Co-authored-by: Jaemin Choi <minitu77@gmail.com>
Co-authored-by: Jaemin Choi <jaeminc@nvidia.com>
Co-authored-by: Michal Futrega <mfutrega@nvidia.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
XuesongYang pushed a commit to paarthneekhara/NeMo that referenced this pull request Jan 18, 2025
…VIDIA#9087) (NVIDIA#9091)

* Tim: Add option for timeout in distopt callback mutex



* Replace parent's _lock



* Revert "Replace parent's _lock"

This reverts commit 972d1b6.



* Raise RuntimeError when timeout



* Change RuntimeError to print



---------

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
Co-authored-by: Jaemin Choi <minitu77@gmail.com>
Co-authored-by: Jaemin Choi <jaeminc@nvidia.com>
Co-authored-by: Michal Futrega <mfutrega@nvidia.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to NeMo Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants