ref: separate slurm from ddp #3809

williamFalcon · 2020-10-02T23:29:03Z

ref: separate slurm from ddp

mergify · 2020-10-03T01:01:01Z

This pull request is now in conflict... :(

ananthsub · 2020-10-03T01:33:01Z

pytorch_lightning/accelerators/ddp_slurm_backend.py

+    def broadcast(self, obj, src=0):
+        return self.dist.broadcast(obj)


why's src unused?

this is an override of the parent method in Accelerator. It's to keep behavior consistent
across accelerators.

src is already set into this object automatically.

pep8speaks · 2020-10-03T02:46:25Z

Hello @williamFalcon! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-03 02:47:46 UTC

ananthsub · 2020-10-03T02:54:29Z

pytorch_lightning/accelerators/ddp_slurm_backend.py

+# -------------------------------------------
+# !!!!!!!!!!!!!! NOTE !!!!!!!!!!!!!!!!!!!!!!
+# TEMP CLASS WHILE WE DECOUPLE SLURM FROM DDP
+# !!!!!!!!!!!!!! NOTE !!!!!!!!!!!!!!!!!!!!!!
+# -------------------------------------------
+class DDPSLURMBackend(Accelerator):


what's the final envisioned setup for slurm + DDP? is torchelastic being split out too?

te is already split out :)

I think right now it's to finish fixing DDP and having 3 types of ddp tied to the same file is making debugging impossible.

Once those tests pass, i'll fix the TE cpu ddp.

Then we'll see where we are with time, but i'd like to make something like a cluster_manager or something that is passed into the trainer as well and then linked to a backend.

ref: separate slurm from ddp

dc17935

mergify bot requested a review from a team October 2, 2020 23:29

ref: separate te from ddp

ad15126

ananthsub reviewed Oct 3, 2020

View reviewed changes

williamFalcon added 2 commits October 2, 2020 22:37

ref: merge

b2b3b80

ref: merge

72638c9

williamFalcon added 2 commits October 2, 2020 22:47

ref: merge

441a709

ref: merge

052b3cc

ananthsub reviewed Oct 3, 2020

View reviewed changes

ananthsub approved these changes Oct 3, 2020

View reviewed changes

williamFalcon merged commit a677833 into master Oct 3, 2020

williamFalcon deleted the fw2_10 branch October 4, 2020 17:37

Borda added the refactor label Oct 4, 2020

Borda added this to the 0.10.0 milestone Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref: separate slurm from ddp #3809

ref: separate slurm from ddp #3809

williamFalcon commented Oct 2, 2020

mergify bot commented Oct 3, 2020

ananthsub Oct 3, 2020

williamFalcon Oct 3, 2020

pep8speaks commented Oct 3, 2020 •

edited

Loading

ananthsub Oct 3, 2020 •

edited

Loading

williamFalcon Oct 3, 2020

		def broadcast(self, obj, src=0):
		return self.dist.broadcast(obj)

ref: separate slurm from ddp #3809

ref: separate slurm from ddp #3809

Conversation

williamFalcon commented Oct 2, 2020

mergify bot commented Oct 3, 2020

ananthsub Oct 3, 2020

Choose a reason for hiding this comment

williamFalcon Oct 3, 2020

Choose a reason for hiding this comment

pep8speaks commented Oct 3, 2020 • edited Loading

Comment last updated at 2020-10-03 02:47:46 UTC

ananthsub Oct 3, 2020 • edited Loading

Choose a reason for hiding this comment

williamFalcon Oct 3, 2020

Choose a reason for hiding this comment

pep8speaks commented Oct 3, 2020 •

edited

Loading

ananthsub Oct 3, 2020 •

edited

Loading