You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I got the same error while training. I am using pytorch_lightning:1.3.8, torch: 1.9.0+cu102.
It seems to happen at random, so i am not sure how to reproduce it.
Here's the pytorch lightening error message
Traceback (most recent call last):
File "train.py", line 203, in
trainer.fit(model)
File "/home/sid/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
self._run(model)
File "/home/sid/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
self.dispatch()
File "/home/sid/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
self.accelerator.start_training(self)
File "/home/sid/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/sid/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/sid/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
return self.run_train()
File "/home/sid/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 909, in run_train
self.training_type_plugin.reconciliate_processes(traceback.format_exc())
File "/home/sid/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 383, in reconciliate_processes
torch.save(True, os.path.join(sync_dir, f"{self.global_rank}.pl"))
File "/home/sid/miniconda3/lib/python3.8/posixpath.py", line 76, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
Can we re-open here or get some guidance how we can make reproducible? I see this at random (re-run does not produce it) on DDP on slurm. Too many workers?
🐛 Bug
I am summarizing the source of the issue to speedup the fix.
After this line of code
https://github.com/PyTorchLightning/pytorch-lightning/blob/90929fa4333e5136020e9f9dcb7c1133e4c290f3/pytorch_lightning/accelerators/ddp_backend.py#L119
I have that
env_copy['PL_GLOBAL_SEED']
isNone
and having an environment variable set to None breakssubprocess.Popen
herehttps://github.com/PyTorchLightning/pytorch-lightning/blob/90929fa4333e5136020e9f9dcb7c1133e4c290f3/pytorch_lightning/accelerators/ddp_backend.py#L127
My fix at the moment is to add
after
https://github.com/PyTorchLightning/pytorch-lightning/blob/90929fa4333e5136020e9f9dcb7c1133e4c290f3/pytorch_lightning/accelerators/ddp_backend.py#L119
Environment
The text was updated successfully, but these errors were encountered: