fix NCCL error with non-consecutive trainer gpus #8165

awaelchli · 2021-06-28T09:14:04Z

What does this PR do?

Can be reproduced on master with the following command:

NCCL_DEBUG=INFO python pl_examples/basic_examples/simple_image_classifier.py --trainer.gpus "1,3"  --trainer.accelerator  ddp

The issue appeared after #7202, where a barrier was introduced right after init_process group but before the ddp model gets configured with device_ids. When calling torch.distributed.barrier on a number of devices, it is assumed that it is affecting devices 0, 1, 2, etc. However, when we want to trainer on gpus=[1,3] for example that's not the case.
The fix is to use the device_ids argument in the barrier call. I do not know why and how this works, don't ask me. But if you know why, let me know.

Partially related, which need to be informed about this fix:

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Note: this cannot really be tested, needs a > 2 GPUs machine!
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-06-28T09:15:40Z

Codecov Report

Merging #8165 (8677cba) into master (c4492ad) will decrease coverage by 5%.
The diff coverage is 80%.

@@           Coverage Diff           @@
##           master   #8165    +/-   ##
=======================================
- Coverage      93%     88%    -5%     
=======================================
  Files         211     211            
  Lines       13450   13456     +6     
=======================================
- Hits        12486   11841   -645     
- Misses        964    1615   +651

x x s same fix for spawn fix non-nccl x

pytorch_lightning/plugins/training_type/ddp.py

kaushikb11 · 2021-06-28T13:57:32Z

@awaelchli Should we add a test for this?

awaelchli · 2021-06-28T14:01:01Z

@kaushikb11 I can add a test but it will require >3 gpus and will never run on our CI (which only has 2). We could then ask Adrian nicely to manually run this test on every ddp-related PR, what do you think? xD

justusschock · 2021-06-28T14:03:06Z

@awaelchli could we maybe mock the barrier call just to see if it is actually called with the devices instead of just the number?

tchaton

Wow, great fix ! Awesome work @awaelchli :)

pep8speaks · 2021-06-28T14:38:15Z

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-06-28 16:29:25 UTC

for more information, see https://pre-commit.ci

…ddp-barrier

awaelchli · 2021-06-28T15:48:04Z

@kaushikb11 @justusschock thanks for your suggetions. I added a test, it's not particularly smart but it fails on master with NCCL error and passes here. Again, it will not run in our ci due to the gpu requirements, but I tested on another server by running the special test directly (master and this PR).

ananthsub

great catch @awaelchli !
I think there's a logline from torch.distributed that says when device ids is not explicitly passed, its inferred from the local rank, but I need to double check this.

awaelchli · 2021-06-28T18:19:25Z

@ananthsub This was also my suspicion, and I first tried to set torch.cuda.set_device(local_rank) before the first barrier call without success.

In the torch.distributed docs it is clearly written that it is the local rank by default for the DistributedDataParallel wrapper.

Please ensure that device_ids argument is set to be the only GPU device id that your code will be operating on. This is generally the local rank of the process. In other words, the device_ids needs to be [args.local_rank], and output_device needs to be args.local_rank in order to use this utility.

And it seems the barrier(device_ids=) need to match this. However, it is a bit a puzzle for me why the barrier needs this when other collective calls don't.

* device ids in barrier x x s same fix for spawn fix non-nccl x * add changelog * get nccl backend * get backend Co-authored-by: Kaushik B <[email protected]>

awaelchli added bug Something isn't working distributed Generic distributed-related topic priority: 0 High priority task labels Jun 28, 2021

awaelchli added this to the v1.3.x milestone Jun 28, 2021

awaelchli changed the title ~~Bugfix/ddp barrier~~ fix NCCL error with non-consecutive trainer gpus Jun 28, 2021

awaelchli added 2 commits June 28, 2021 12:07

device ids in barrier

d40385e

x x s same fix for spawn fix non-nccl x

add changelog

6979e50

awaelchli force-pushed the bugfix/ddp-barrier branch from de03f74 to 6979e50 Compare June 28, 2021 10:20

awaelchli commented Jun 28, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/ddp.py Outdated Show resolved Hide resolved

justusschock approved these changes Jun 28, 2021

View reviewed changes

awaelchli force-pushed the bugfix/ddp-barrier branch 4 times, most recently from b055365 to 6979e50 Compare June 28, 2021 13:15

awaelchli added 3 commits June 28, 2021 15:17

get nccl backend

2fdee97

get backend

7283b38

Merge branch 'master' into bugfix/ddp-barrier

4e91001

awaelchli marked this pull request as ready for review June 28, 2021 13:55

awaelchli requested review from Borda, carmocca, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners June 28, 2021 13:55

mergify bot added the has conflicts label Jun 28, 2021

Merge branch 'master' into bugfix/ddp-barrier

b2366b3

kaushikb11 approved these changes Jun 28, 2021

View reviewed changes

mergify bot removed the has conflicts label Jun 28, 2021

tchaton approved these changes Jun 28, 2021

View reviewed changes

tchaton enabled auto-merge (squash) June 28, 2021 14:03

add test

7a913d8

awaelchli and others added 3 commits June 28, 2021 16:39

test

6974299

[pre-commit.ci] auto fixes from pre-commit.com hooks

1439be1

for more information, see https://pre-commit.ci

Merge remote-tracking branch 'origin/bugfix/ddp-barrier' into bugfix/…

df167ab

…ddp-barrier

awaelchli force-pushed the bugfix/ddp-barrier branch from 0d9284e to 0845383 Compare June 28, 2021 15:45

This was referenced Jun 28, 2021

CUDA OOM when initializing DDP #4705

Closed

CUDA OOM when using "ddp" mode in training #7817

Closed

update ddp test

8677cba

awaelchli force-pushed the bugfix/ddp-barrier branch from 0845383 to 8677cba Compare June 28, 2021 16:29

ananthsub approved these changes Jun 28, 2021

View reviewed changes

tchaton merged commit bf54ac1 into master Jun 28, 2021

tchaton deleted the bugfix/ddp-barrier branch June 28, 2021 20:08

awaelchli mentioned this pull request Jun 29, 2021

Clean cuda.empty_cache usage #8199

Merged

7 tasks

awaelchli mentioned this pull request Jul 1, 2021

torch.save with ddp accelerator throwing RuntimeError: Tensors must be CUDA and dense #8227

Closed

Adel-Moumen mentioned this pull request Sep 15, 2024

Critical: Fix DDP barriers speechbrain/speechbrain#2686

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix NCCL error with non-consecutive trainer gpus #8165

fix NCCL error with non-consecutive trainer gpus #8165

awaelchli commented Jun 28, 2021 •

edited

Loading

codecov bot commented Jun 28, 2021 •

edited

Loading

kaushikb11 commented Jun 28, 2021

awaelchli commented Jun 28, 2021

justusschock commented Jun 28, 2021

tchaton left a comment

pep8speaks commented Jun 28, 2021 •

edited

Loading

awaelchli commented Jun 28, 2021

ananthsub left a comment

awaelchli commented Jun 28, 2021 •

edited

Loading

fix NCCL error with non-consecutive trainer gpus #8165

fix NCCL error with non-consecutive trainer gpus #8165

Conversation

awaelchli commented Jun 28, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented Jun 28, 2021 • edited Loading

Codecov Report

kaushikb11 commented Jun 28, 2021

awaelchli commented Jun 28, 2021

justusschock commented Jun 28, 2021

tchaton left a comment

Choose a reason for hiding this comment

pep8speaks commented Jun 28, 2021 • edited Loading

Comment last updated at 2021-06-28 16:29:25 UTC

awaelchli commented Jun 28, 2021

ananthsub left a comment

Choose a reason for hiding this comment

awaelchli commented Jun 28, 2021 • edited Loading

awaelchli commented Jun 28, 2021 •

edited

Loading

codecov bot commented Jun 28, 2021 •

edited

Loading

pep8speaks commented Jun 28, 2021 •

edited

Loading

awaelchli commented Jun 28, 2021 •

edited

Loading