'RuntimeError: No rendezvous handler for env://' with multi-gpu #5358

costantinoai · 2021-01-05T12:01:57Z

🐛 Bug

I get an error
'RuntimeError: No rendezvous handler for env://'
when I run my model with multiple GPU.

Below the code and the traceback:

trainer = pl.Trainer(gpus = -1,
                     accelerator='ddp',
                     check_val_every_n_epoch=10, 
                    # precision=16,
                    # auto_scale_batch_size='binsearch',
                     callbacks=[checkpoint_callback],
                     max_epochs = 1)

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

trainer.fit(model)

initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):

File "", line 1, in
trainer.fit(model)

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 470, in fit
results = self.accelerator_backend.train()

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\pytorch_lightning\accelerators\ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\pytorch_lightning\accelerators\ddp_accelerator.py", line 252, in ddp_train
self.init_ddp_connection(

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 153, in init_ddp_connection
self.ddp_plugin.init_ddp_connection(

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\pytorch_lightning\plugins\ddp_plugin.py", line 90, in init_ddp_connection
torch_distrib.init_process_group(

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\torch\distributed\distributed_c10d.py", line 433, in init_process_group
rendezvous_iterator = rendezvous(

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous
raise RuntimeError("No rendezvous handler for {}://".format(result.scheme))

RuntimeError: No rendezvous handler for env://

The error is not present if I set

gpus = 1

Expected behavior

Environment

PyTorch Version (e.g., 1.0): 1.7.1
OS (e.g., Linux): Windows 10
How you installed PyTorch (conda, pip, source): conda
Build command you used (if compiling from source): conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
Python version: 3.8.5
CUDA/cuDNN version: 11.0
GPU models and configuration: 2 * Quadro RTX 6000
Any other relevant information:

The text was updated successfully, but these errors were encountered:

github-actions · 2021-01-05T12:02:38Z

Hi! thanks for your contribution!, great first issue!

costantinoai · 2021-01-05T12:05:10Z

Also, I don't know if it is related, but when I check the GPU performance during training (with the flag GPU = 1) using windows task manager, I can see only 1-2% used in the GPU, and 45-50% in the CPU. Is this a normal behaviour?

Borda · 2021-01-06T08:09:50Z

@costantinoai mind share what PL version are you using? also, do you have and full example to reproduce?

costantinoai · 2021-01-06T08:16:57Z

Hi @Borda ,
Thanks for your reply.

PL version is 1.1.2.

I do have an example of the full code on colab, but I would rather not post it publicly.

How can I share it with you?

awaelchli · 2021-01-07T00:37:20Z

Hi, you can ping me on slack if you want. It's probably an issue with passing the argument gpus=-1 to the subprocess script. I bet if you set gpus=n where n is the number of gpus, it will work. We just have to support -1 for ddp.

costantinoai · 2021-01-07T01:11:33Z

Ok, thanks. I’ll try setting n and see what happens. I’ll send you the Colab link on slack if I still have issues. Really appreciated! Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Adrian Wälchli <[email protected]> Sent: Thursday, January 7, 2021 11:37:33 AM To: PyTorchLightning/pytorch-lightning <[email protected]> Cc: Andrea Costantino <[email protected]>; Mention <[email protected]> Subject: Re: [PyTorchLightning/pytorch-lightning] 'RuntimeError: No rendezvous handler for env://' with multi-gpu (#5358) Hi, you can ping me on slack if you want. It's probably an issue with passing the argument gpus=-1 to the script. I bet if you set gpus=n where n is the number of gpus, it will work. We just have to support -1 for ddp. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#5358 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AOCXNCMM3CA3OQJSM2ZAP5DSYT643ANCNFSM4VU7DUZQ>.

costantinoai · 2021-01-07T10:11:46Z

@awaelchli still got the same problem after setting gpus = 2. I reached you on twitter (I don't have a slack account).

Thanks!

awaelchli · 2021-01-07T11:36:02Z

In summary after private conversation with @costantinoai

ddp not supported on windows platform (yet)
script needs guard around entry point (if __name__ == "__main__")

if these requirements are not met, we see the No rendezvous handler for env://' or similar exceptions.

BlockWaving · 2021-01-28T06:14:32Z

Hi, I also get this error when adding the second gpu to machine:

RuntimeError: No rendezvous handler for env://

please advise how to fix and/or work around?
Thx!

awaelchli · 2021-01-31T21:05:41Z

RuntimeError: No rendezvous handler for env://

That's not much information, but one possibility is because you are on Windows.
accelerator=ddp will not work on windows, you have to choose dp.

mdja · 2021-02-03T04:39:44Z

I am on windows and saw this error. change accelerator to 'dp' works.

DavidRimel · 2021-03-02T23:02:00Z

I am on windows and saw this error. change accelerator to 'dp' works.

Windows 10 user here.. this worked for me

carlomarxdk · 2021-03-04T10:56:26Z

I am on Pytorch-Lightning 1.2.1. and I still run into the issue on Windows if I set accelerator to "dp". I am training on 1 GPU.
I encounter this issue when I use DeepSpeed plugin.

awaelchli · 2021-03-04T11:15:24Z

@carlomarxdk is deepspeed supported on windows? I can't find any mention of it, so probably not.

ibrahimishag · 2021-03-09T09:05:38Z

I ran into this issue on Windows 10.

ibrahimishag · 2021-03-10T08:40:20Z

Changing the accelerator to dp on Windows 10 as suggested by @awaelchli and @mdja solved my issue.
Thank you.

costantinoai added bug Something isn't working help wanted Open to be worked on labels Jan 5, 2021

Borda assigned awaelchli Jan 6, 2021

Borda added the priority: 1 Medium priority task label Jan 6, 2021

awaelchli mentioned this issue Jan 7, 2021

Update notes on ddp_spawn accelerator in multi-gpu docs #5402

Merged

12 tasks

awaelchli closed this as completed in #5402 Jan 13, 2021

agolynski mentioned this issue Mar 9, 2021

RuntimeError: No rendezvous handler for env:// pytorch/pytorch#53135

Closed

easonnie mentioned this issue Mar 10, 2021

RuntimeError: No rendezvous handler for env:// facebookresearch/anli#20

Closed

ZHEQIUSHUI mentioned this issue Apr 20, 2021

work in windows platform RangiLyu/nanodet#222

Closed

thetushar006 mentioned this issue Jun 14, 2021

Request help to resolve "No rendezvous handler" error on Windows 10 microsoft/Swin-Transformer#77

Closed

nightlessbaron mentioned this issue Sep 16, 2021

[Feat] Add support for distributed training learnables/learn2learn#257

Open

4 tasks

CA4GitHub mentioned this issue Nov 6, 2021

Runtime Error: No rendezvous handler for env:// facebookresearch/dino#151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'RuntimeError: No rendezvous handler for env://' with multi-gpu #5358

'RuntimeError: No rendezvous handler for env://' with multi-gpu #5358

costantinoai commented Jan 5, 2021

github-actions bot commented Jan 5, 2021

costantinoai commented Jan 5, 2021

Borda commented Jan 6, 2021

costantinoai commented Jan 6, 2021 •

edited

Loading

awaelchli commented Jan 7, 2021 •

edited

Loading

costantinoai commented Jan 7, 2021 via email

costantinoai commented Jan 7, 2021

awaelchli commented Jan 7, 2021

BlockWaving commented Jan 28, 2021

awaelchli commented Jan 31, 2021

mdja commented Feb 3, 2021

DavidRimel commented Mar 2, 2021

carlomarxdk commented Mar 4, 2021 •

edited

Loading

awaelchli commented Mar 4, 2021

ibrahimishag commented Mar 9, 2021

ibrahimishag commented Mar 10, 2021

'RuntimeError: No rendezvous handler for env://' with multi-gpu #5358

'RuntimeError: No rendezvous handler for env://' with multi-gpu #5358

Comments

costantinoai commented Jan 5, 2021

🐛 Bug

Expected behavior

Environment

github-actions bot commented Jan 5, 2021

costantinoai commented Jan 5, 2021

Borda commented Jan 6, 2021

costantinoai commented Jan 6, 2021 • edited Loading

awaelchli commented Jan 7, 2021 • edited Loading

costantinoai commented Jan 7, 2021 via email

costantinoai commented Jan 7, 2021

awaelchli commented Jan 7, 2021

BlockWaving commented Jan 28, 2021

awaelchli commented Jan 31, 2021

mdja commented Feb 3, 2021

DavidRimel commented Mar 2, 2021

carlomarxdk commented Mar 4, 2021 • edited Loading

awaelchli commented Mar 4, 2021

ibrahimishag commented Mar 9, 2021

ibrahimishag commented Mar 10, 2021

costantinoai commented Jan 6, 2021 •

edited

Loading

awaelchli commented Jan 7, 2021 •

edited

Loading

carlomarxdk commented Mar 4, 2021 •

edited

Loading