-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix NCCL error with non-consecutive trainer gpus #8165
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8165 +/- ##
=======================================
- Coverage 93% 88% -5%
=======================================
Files 211 211
Lines 13450 13456 +6
=======================================
- Hits 12486 11841 -645
- Misses 964 1615 +651 |
x x s same fix for spawn fix non-nccl x
de03f74
to
6979e50
Compare
b055365
to
6979e50
Compare
@awaelchli Should we add a test for this? |
@kaushikb11 I can add a test but it will require >3 gpus and will never run on our CI (which only has 2). We could then ask Adrian nicely to manually run this test on every ddp-related PR, what do you think? xD |
@awaelchli could we maybe mock the barrier call just to see if it is actually called with the devices instead of just the number? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, great fix ! Awesome work @awaelchli :)
Hello @awaelchli! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-06-28 16:29:25 UTC |
for more information, see https://pre-commit.ci
0d9284e
to
0845383
Compare
@kaushikb11 @justusschock thanks for your suggetions. I added a test, it's not particularly smart but it fails on master with NCCL error and passes here. Again, it will not run in our ci due to the gpu requirements, but I tested on another server by running the special test directly (master and this PR). |
0845383
to
8677cba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch @awaelchli !
I think there's a logline from torch.distributed that says when device ids is not explicitly passed, its inferred from the local rank, but I need to double check this.
@ananthsub This was also my suspicion, and I first tried to set In the torch.distributed docs it is clearly written that it is the local rank by default for the
And it seems the barrier(device_ids=) need to match this. However, it is a bit a puzzle for me why the barrier needs this when other collective calls don't. |
* device ids in barrier x x s same fix for spawn fix non-nccl x * add changelog * get nccl backend * get backend Co-authored-by: Kaushik B <[email protected]>
* device ids in barrier x x s same fix for spawn fix non-nccl x * add changelog * get nccl backend * get backend Co-authored-by: Kaushik B <[email protected]>
* device ids in barrier x x s same fix for spawn fix non-nccl x * add changelog * get nccl backend * get backend Co-authored-by: Kaushik B <[email protected]>
* device ids in barrier x x s same fix for spawn fix non-nccl x * add changelog * get nccl backend * get backend Co-authored-by: Kaushik B <[email protected]>
* device ids in barrier x x s same fix for spawn fix non-nccl x * add changelog * get nccl backend * get backend Co-authored-by: Kaushik B <[email protected]>
What does this PR do?
Fixes #8139
Can be reproduced on master with the following command:
NCCL_DEBUG=INFO python pl_examples/basic_examples/simple_image_classifier.py --trainer.gpus "1,3" --trainer.accelerator ddp
The issue appeared after #7202, where a barrier was introduced right after init_process group but before the ddp model gets configured with device_ids. When calling
torch.distributed.barrier
on a number of devices, it is assumed that it is affecting devices 0, 1, 2, etc. However, when we want to trainer ongpus=[1,3]
for example that's not the case.The fix is to use the
device_ids
argument in the barrier call. I do not know why and how this works, don't ask me. But if you know why, let me know.Partially related, which need to be informed about this fix:
Before submitting
Note: this cannot really be tested, needs a > 2 GPUs machine!
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃