Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref: separate slurm from ddp #3809

Merged
merged 6 commits into from
Oct 3, 2020
Merged

ref: separate slurm from ddp #3809

merged 6 commits into from
Oct 3, 2020

Conversation

williamFalcon
Copy link
Contributor

ref: separate slurm from ddp

@mergify mergify bot requested a review from a team October 2, 2020 23:29
@mergify
Copy link
Contributor

mergify bot commented Oct 3, 2020

This pull request is now in conflict... :(

Comment on lines +98 to +99
def broadcast(self, obj, src=0):
return self.dist.broadcast(obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why's src unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an override of the parent method in Accelerator. It's to keep behavior consistent
across accelerators.

src is already set into this object automatically.

@pep8speaks
Copy link

pep8speaks commented Oct 3, 2020

Hello @williamFalcon! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-03 02:47:46 UTC

Comment on lines +36 to +41
# -------------------------------------------
# !!!!!!!!!!!!!! NOTE !!!!!!!!!!!!!!!!!!!!!!
# TEMP CLASS WHILE WE DECOUPLE SLURM FROM DDP
# !!!!!!!!!!!!!! NOTE !!!!!!!!!!!!!!!!!!!!!!
# -------------------------------------------
class DDPSLURMBackend(Accelerator):
Copy link
Contributor

@ananthsub ananthsub Oct 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the final envisioned setup for slurm + DDP? is torchelastic being split out too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

te is already split out :)

I think right now it's to finish fixing DDP and having 3 types of ddp tied to the same file is making debugging impossible.

Once those tests pass, i'll fix the TE cpu ddp.

Then we'll see where we are with time, but i'd like to make something like a cluster_manager or something that is passed into the trainer as well and then linked to a backend.

@williamFalcon williamFalcon merged commit a677833 into master Oct 3, 2020
@williamFalcon williamFalcon deleted the fw2_10 branch October 4, 2020 17:37
@Borda Borda added the refactor label Oct 4, 2020
@Borda Borda added this to the 0.10.0 milestone Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants