-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option for mutex timeout in distributed optimizer backward hook #9087
Add option for mutex timeout in distributed optimizer backward hook #9087
Conversation
Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
dd78d22
to
f36140a
Compare
Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
f36140a
to
972d1b6
Compare
This reverts commit 972d1b6. Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
@ShriyaPalsamudram Tested on Eos single-node, the updated code spits out a RuntimeError when the lock is not obtained within the given timeout period. |
How were you able to create a timeout situation in one node? Did you set the timer to very very small value? |
Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
self._lock.release() | ||
else: | ||
# Failed to acquire lock before timeout | ||
print(f'MegatronDistributedFusedAdam: Failed to acquire lock within {lock_timeout} seconds.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should use logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that we want the print to happen even with NeMo logging disabled, for hang debugging purposes. @ShriyaPalsamudram
…9087) * Tim: Add option for timeout in distopt callback mutex Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> * Replace parent's _lock Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> * Revert "Replace parent's _lock" This reverts commit 972d1b6. Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> * Raise RuntimeError when timeout Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> * Change RuntimeError to print Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> --------- Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com>
…9087) (#9091) * Tim: Add option for timeout in distopt callback mutex * Replace parent's _lock * Revert "Replace parent's _lock" This reverts commit 972d1b6. * Raise RuntimeError when timeout * Change RuntimeError to print --------- Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jaemin Choi <minitu77@gmail.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: Pablo Garay <palenq@gmail.com>
…VIDIA#9087) (NVIDIA#9091) * Tim: Add option for timeout in distopt callback mutex * Replace parent's _lock * Revert "Replace parent's _lock" This reverts commit 972d1b6. * Raise RuntimeError when timeout * Change RuntimeError to print --------- Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jaemin Choi <minitu77@gmail.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: Pablo Garay <palenq@gmail.com>
…VIDIA#9087) (NVIDIA#9091) * Tim: Add option for timeout in distopt callback mutex * Replace parent's _lock * Revert "Replace parent's _lock" This reverts commit 972d1b6. * Raise RuntimeError when timeout * Change RuntimeError to print --------- Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jaemin Choi <minitu77@gmail.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: Pablo Garay <palenq@gmail.com>
…VIDIA#9087) (NVIDIA#9091) * Tim: Add option for timeout in distopt callback mutex * Replace parent's _lock * Revert "Replace parent's _lock" This reverts commit 972d1b6. * Raise RuntimeError when timeout * Change RuntimeError to print --------- Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jaemin Choi <minitu77@gmail.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: Pablo Garay <palenq@gmail.com>
…VIDIA#9087) (NVIDIA#9091) * Tim: Add option for timeout in distopt callback mutex * Replace parent's _lock * Revert "Replace parent's _lock" This reverts commit 972d1b6. * Raise RuntimeError when timeout * Change RuntimeError to print --------- Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jaemin Choi <minitu77@gmail.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: Pablo Garay <palenq@gmail.com>
What does this PR do ?
Modified version of #9084, to debug a hang at
NeMo/nemo/core/optim/distributed_adam.py
Line 131 in f658b6f
Collection: NLP
Changelog
Usage
Run GPT, e.g. with the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml.
Enable the distributed optimizer with
model.optim.name=distributed_fused_adam
and set the timeout withmodel.optim.lock_timeout=<seconds>
.Jenkins CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
There's no need to comment
jenkins
on the PR to trigger Jenkins CI.The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information