[ddp] Support multi-node distributed execution under torchelastic #1811

ashwinb · 2020-05-13T04:12:35Z

What does this PR do?

It allows using Lightning distributed trainers where workers are managed by torchelastic (https://github.com/pytorch/elastic)

tullie · 2020-05-13T04:33:43Z

pytorch_lightning/core/lightning.py

-            log.warning("WORLD_SIZE environment variable is not equal to the computed "
-                        "world size. Ignored.")
+        if 'WORLD_SIZE' in os.environ and int(os.environ['WORLD_SIZE']) != world_size:
+            log.warning("WORLD_SIZE environment variable ({}) is not equal to the computed "


Can you please use python format string literals instead f""

tullie · 2020-05-13T04:39:27Z

pytorch_lightning/trainer/distrib_data_parallel.py

        # otherwise use given node rank or default to node rank 0
-        try:
-            node_id = os.environ['SLURM_NODEID'] if self.is_slurm_managing_tasks else os.environ['NODE_RANK']
+        node_keys = ['SLURM_NODEID', 'NODE_RANK', 'GROUP_RANK']


I'm concerned about cases when multiple of these are set. Can we log a warning if we use one of these IDs and the other two are set? I think writing the code that way would make it easier to understand what's happening here too.

tullie · 2020-05-13T04:49:52Z

pytorch_lightning/trainer/trainer.py

        elif self.use_ddp:
            if self.is_slurm_managing_tasks:
                task = int(os.environ['SLURM_LOCALID'])
                self.ddp_train(task, model)
+            # torchelastic
+            elif 'WORLD_SIZE' in os.environ and 'GROUP_RANK' in os.environ:


I wonder what we need to do to get this working on slurm. It could be as simple as using the LOCAL_RANK environment instead of the SLURM_LOCALID.

I'll look into it.

Okay so looked into it and it's not that simple. Basically, much like distributed training, there's a few ways to initialize elastic training. However, because elastic training needs to own the processes to work, slurm can't spawn them for it.

In distributed training you have the options:

run python -m torch.distributed.launch <train_script.py>, which creates the processes for you

start the processes yourself using mp (e.g. in the else of this if statement)

let slurm (or another scheduler) create the processes

In elastic training you have the options:

run python3 -m torchelastic.distributed.launch, which creates the processes and handles the fault-tolerence and elastic workers.

create an elastic agent such as LocalElasticAgent. This will spawn the elastic processes and manage them with the synchronous function LocalElasticAgent.run().

This doesn't leave a particularly easy way to do slurm because the agent needs to spawn the processes. My guess is that you need to configure slurm to have 1 process per node (i.e. ntasks-per-node=1) and then create the agent and processes as explained in 2. at the beginning of training. You'd also need to setup the distributed key-value store backend (Etcd or Zeus). Luckily they've provided a helpful python API for spawning Etcd server.

@tullie I don't know what you mean. Lightning already works correctly under a Slurm managed task environment. Do you mean having the same code for both pytorch elastic and slurm?

yeah, this should work fine no?
If this is the route we take with elastic it means that something else created the process that called each script. Is that the expected behavior @ashwinb @tullie (I haven't used elastic yet).

Yes, elastic launches agents on each node which manage the individual worker processes. Lightning's job in that case is to init its process group and configure and run a single trainer worker. This is just like Slurm.

All I was saying is that this PR doesn't add support for Elastic Pytorch in a Slurm managed environment. This is fine for now but ideally they'd be able to work together in the future.

pytorch_lightning/core/lightning.py

pytorch_lightning/trainer/distrib_data_parallel.py

williamFalcon · 2020-05-13T16:18:36Z

pytorch_lightning/trainer/trainer.py

        self.configure_slurm_ddp(self.num_nodes)
+        self.node_rank = self.determine_node_rank()


seems like this function is not defined?

@williamFalcon it is! in the distributed_data_parallel.py mixin.

Ah, typo. Correcting.

oh cool.. ic.

I guess the names don't match
determine_ddp_node_rank -> determine_node_rank

here's the suggested change

Suggested change

self.node_rank = self.determine_node_rank()

self.node_rank = self.determine_ddp_node_rank()

williamFalcon · 2020-05-13T16:35:51Z

@Borda last PR then let's do rc2

tullie

LGTM

williamFalcon · 2020-05-13T17:21:19Z

pytorch_lightning/trainer/distrib_data_parallel.py

+            log.warning("No environment variable for node rank defined. Set as 0.")
+            return 0
+        if len(node_ids) > 1:
+            log.warning(f"Multiple environment variables ({keys(node_ids)} defined for node rank. "


"keys" is undefined here

./pytorch_lightning/trainer/distrib_data_parallel.py:295: [F821] undefined name 'keys'

The changes are quite local and limited in nature -- viz., checking for some indicator environment variables. We check for (SLURM_LOCALID, NODE_RANK, GROUP_RANK) in order. If multiple are found set, a warning is logged. This patch also fixes a minor bug with comparing the `WORLD_SIZE` environment variable. This can be a string type.

) The changes are quite local and limited in nature -- viz., checking for some indicator environment variables. We check for (SLURM_LOCALID, NODE_RANK, GROUP_RANK) in order. If multiple are found set, a warning is logged. This patch also fixes a minor bug with comparing the `WORLD_SIZE` environment variable. This can be a string type. (cherry picked from commit aefc531)

mergify bot requested a review from a team May 13, 2020 04:13

tullie suggested changes May 13, 2020

View reviewed changes

mergify bot requested a review from a team May 13, 2020 04:50

Borda reviewed May 13, 2020

View reviewed changes

pytorch_lightning/core/lightning.py Outdated Show resolved Hide resolved

Borda reviewed May 13, 2020

View reviewed changes

pytorch_lightning/trainer/distrib_data_parallel.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team May 13, 2020 08:32

Borda assigned tullie May 13, 2020

ashwinb force-pushed the torchelastic branch from 0da6a88 to e0624a6 Compare May 13, 2020 16:05

Borda added the feature Is an improvement or enhancement label May 13, 2020

Borda self-requested a review May 13, 2020 16:16

williamFalcon reviewed May 13, 2020

View reviewed changes

mergify bot requested a review from a team May 13, 2020 16:19

ashwinb force-pushed the torchelastic branch from e0624a6 to 658b1f6 Compare May 13, 2020 16:30

ashwinb force-pushed the torchelastic branch from 658b1f6 to 6c9b3aa Compare May 13, 2020 16:37

tullie approved these changes May 13, 2020

View reviewed changes

mergify bot requested a review from a team May 13, 2020 17:07

williamFalcon reviewed May 13, 2020

View reviewed changes

mergify bot requested a review from a team May 13, 2020 17:21

ashwinb force-pushed the torchelastic branch from 6c9b3aa to 4565e53 Compare May 13, 2020 17:26

williamFalcon merged commit aefc531 into Lightning-AI:master May 13, 2020

ananthsub mentioned this pull request Jul 18, 2020

Fix local rank zero casting #2640

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ddp] Support multi-node distributed execution under torchelastic #1811

[ddp] Support multi-node distributed execution under torchelastic #1811

ashwinb commented May 13, 2020

tullie May 13, 2020

tullie May 13, 2020

williamFalcon May 13, 2020

tullie May 13, 2020

tullie May 13, 2020

ashwinb May 13, 2020

williamFalcon May 13, 2020

ashwinb May 13, 2020

tullie May 13, 2020

williamFalcon May 13, 2020

ashwinb May 13, 2020

ashwinb May 13, 2020

williamFalcon May 13, 2020 •

edited

Loading

williamFalcon May 13, 2020 •

edited

Loading

williamFalcon commented May 13, 2020

tullie left a comment

williamFalcon May 13, 2020 •

edited

Loading

williamFalcon May 13, 2020

		self.configure_slurm_ddp(self.num_nodes)
		self.node_rank = self.determine_node_rank()

	self.node_rank = self.determine_node_rank()
	self.node_rank = self.determine_ddp_node_rank()

[ddp] Support multi-node distributed execution under torchelastic #1811

[ddp] Support multi-node distributed execution under torchelastic #1811

Conversation

ashwinb commented May 13, 2020

What does this PR do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williamFalcon May 13, 2020 • edited Loading

Choose a reason for hiding this comment

williamFalcon May 13, 2020 • edited Loading

Choose a reason for hiding this comment

williamFalcon commented May 13, 2020

tullie left a comment

Choose a reason for hiding this comment

williamFalcon May 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williamFalcon May 13, 2020 •

edited

Loading

williamFalcon May 13, 2020 •

edited

Loading

williamFalcon May 13, 2020 •

edited

Loading