-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ddp] Support multi-node distributed execution under torchelastic #1811
Conversation
pytorch_lightning/core/lightning.py
Outdated
log.warning("WORLD_SIZE environment variable is not equal to the computed " | ||
"world size. Ignored.") | ||
if 'WORLD_SIZE' in os.environ and int(os.environ['WORLD_SIZE']) != world_size: | ||
log.warning("WORLD_SIZE environment variable ({}) is not equal to the computed " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please use python format string literals instead f""
# otherwise use given node rank or default to node rank 0 | ||
try: | ||
node_id = os.environ['SLURM_NODEID'] if self.is_slurm_managing_tasks else os.environ['NODE_RANK'] | ||
node_keys = ['SLURM_NODEID', 'NODE_RANK', 'GROUP_RANK'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about cases when multiple of these are set. Can we log a warning if we use one of these IDs and the other two are set? I think writing the code that way would make it easier to understand what's happening here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea.
elif self.use_ddp: | ||
if self.is_slurm_managing_tasks: | ||
task = int(os.environ['SLURM_LOCALID']) | ||
self.ddp_train(task, model) | ||
# torchelastic | ||
elif 'WORLD_SIZE' in os.environ and 'GROUP_RANK' in os.environ: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what we need to do to get this working on slurm. It could be as simple as using the LOCAL_RANK environment instead of the SLURM_LOCALID.
I'll look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so looked into it and it's not that simple. Basically, much like distributed training, there's a few ways to initialize elastic training. However, because elastic training needs to own the processes to work, slurm can't spawn them for it.
In distributed training you have the options:
- run
python -m torch.distributed.launch <train_script.py>
, which creates the processes for you - start the processes yourself using mp (e.g. in the else of this if statement)
- let slurm (or another scheduler) create the processes
In elastic training you have the options:
- run
python3 -m torchelastic.distributed.launch
, which creates the processes and handles the fault-tolerence and elastic workers. - create an elastic agent such as LocalElasticAgent. This will spawn the elastic processes and manage them with the synchronous function LocalElasticAgent.run().
This doesn't leave a particularly easy way to do slurm because the agent needs to spawn the processes. My guess is that you need to configure slurm to have 1 process per node (i.e. ntasks-per-node=1) and then create the agent and processes as explained in 2. at the beginning of training. You'd also need to setup the distributed key-value store backend (Etcd or Zeus). Luckily they've provided a helpful python API for spawning Etcd server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tullie I don't know what you mean. Lightning already works correctly under a Slurm managed task environment. Do you mean having the same code for both pytorch elastic and slurm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, elastic launches agents on each node which manage the individual worker processes. Lightning's job in that case is to init its process group and configure and run a single trainer worker. This is just like Slurm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All I was saying is that this PR doesn't add support for Elastic Pytorch in a Slurm managed environment. This is fine for now but ideally they'd be able to work together in the future.
pytorch_lightning/trainer/trainer.py
Outdated
self.configure_slurm_ddp(self.num_nodes) | ||
self.node_rank = self.determine_node_rank() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this function is not defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@williamFalcon it is! in the distributed_data_parallel.py
mixin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, typo. Correcting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh cool.. ic.
I guess the names don't match
determine_ddp_node_rank -> determine_node_rank
here's the suggested change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.node_rank = self.determine_node_rank() | |
self.node_rank = self.determine_ddp_node_rank() |
@Borda last PR then let's do rc2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
log.warning("No environment variable for node rank defined. Set as 0.") | ||
return 0 | ||
if len(node_ids) > 1: | ||
log.warning(f"Multiple environment variables ({keys(node_ids)} defined for node rank. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"keys" is undefined here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
./pytorch_lightning/trainer/distrib_data_parallel.py:295: [F821] undefined name 'keys'
The changes are quite local and limited in nature -- viz., checking for some indicator environment variables. We check for (SLURM_LOCALID, NODE_RANK, GROUP_RANK) in order. If multiple are found set, a warning is logged. This patch also fixes a minor bug with comparing the `WORLD_SIZE` environment variable. This can be a string type.
) The changes are quite local and limited in nature -- viz., checking for some indicator environment variables. We check for (SLURM_LOCALID, NODE_RANK, GROUP_RANK) in order. If multiple are found set, a warning is logged. This patch also fixes a minor bug with comparing the `WORLD_SIZE` environment variable. This can be a string type. (cherry picked from commit aefc531)
What does this PR do?
It allows using Lightning distributed trainers where workers are managed by torchelastic (https://github.com/pytorch/elastic)