DDP fails with DDPPlugin and num_nodes>1 (with SLURM) #7429
Labels
bug
Something isn't working
distributed
Generic distributed-related topic
environment: slurm
help wanted
Open to be worked on
priority: 1
Medium priority task
🐛 Bug
DDPPlugin
crashes my training scripts when training models on multiple nodes (using SLURM).When I use multiple GPUs on 1 node with the plugin -> all gucci.
When I use multiple GPUs on multiple nodes without the plugin -> all gucci.
When I use multiple GPUs on multiple nodes with the plugin -> crashes 😢
So when I run...
Code fails with...
To Reproduce
debug.py
:submit.sh
Commands
Expected behaviour
Code shouldn't fail..? :D
Environment
The text was updated successfully, but these errors were encountered: