You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My machines used for multi-node training do not allow ssh service.
How can I launch multi-node training using the most basic torchrun command (torch.distributed.launch) ?
The servers which I use do not have slurm. And I found both openmpi and pdsh rely on ssh service.
So I run out of all the ways provided in this repo's README to start a multi-node training job.
The text was updated successfully, but these errors were encountered:
My machines used for multi-node training do not allow ssh service.
How can I launch multi-node training using the most basic torchrun command (torch.distributed.launch) ?
The servers which I use do not have slurm. And I found both openmpi and pdsh rely on ssh service.
So I run out of all the ways provided in this repo's README to start a multi-node training job.
The text was updated successfully, but these errors were encountered: