Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate collective functions #7534

Open
ananthsub opened this issue May 13, 2021 · 8 comments
Open

Consolidate collective functions #7534

ananthsub opened this issue May 13, 2021 · 8 comments
Labels
distributed Generic distributed-related topic feature Is an improvement or enhancement let's do it! approved to implement refactor
Milestone

Comments

@ananthsub
Copy link
Contributor

ananthsub commented May 13, 2021

🚀 Feature

Lightning should offer a central place to use the collective functions provided here: https://pytorch.org/docs/stable/distributed.html#collective-functions

Motivation

LightningModule code is usually agnostic to what device its running on or whether its running in a distributed training environment. However, there are times where the module does need to rely on collective functions.

In Lightning, we currently have many places where these are offered:

Some of these call each other and the dependency isn't very clear now, so it is confusing for users which to go through.

Pitch

  1. Offer these utilities under a central place: pytorch_lightning/utilities/collectives.py for these utilities:
    barrier, all_gather, broadcast, etc

These should be very thin wrappers over the PyTorch distributed functions, checking if torch.distributed is available and initialized. If not, we return what's expected for single-process training.

  1. Update the callsites internally to use to these implementations

  2. Mark existing functions as deprecated and slated for removal in v1.6

cc @Borda @awaelchli @rohitgr7 @akihironitta @justusschock

@ananthsub ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on refactor labels May 13, 2021
@ananthsub ananthsub added this to the v1.4 milestone May 13, 2021
@ananthsub ananthsub self-assigned this May 13, 2021
@ananthsub ananthsub changed the title Offer collective functions under Lightning utilities Consolidate collective functions May 14, 2021
@edenlightning edenlightning modified the milestones: v1.4, v1.5 Jul 6, 2021
@bowangbj
Copy link

bowangbj commented Aug 10, 2021

Thanks Ananth and others. Based on the inputs, we came up with a mini-proposal: https://docs.google.com/document/d/1e83FcZHHHsTmiUmNpgmPTBYjugZaC3pcZhd4VV9AuIM/edit

Looking forward to the inputs.

@ananthsub
Copy link
Contributor Author

@tchaton @carmocca @awaelchli @justusschock please take a look!

@justusschock
Copy link
Member

@ananthsub @bowangbj I left some comments and in general I really like the idea :)

@awaelchli
Copy link
Contributor

The Collective class is a nice addition. I like it as well!

@bowangbj
Copy link

Thanks Justus, Adrian and all for the super quick review.
Addressed most of the comments in the doc.
Starting implementation ...

@tchaton
Copy link
Contributor

tchaton commented Aug 20, 2021

Hey @ananthsub, I definitely like the idea.

However, would the collective be Trainer aware as It don't think we could support collective for TPU / Horovod / etc.. without nothing which accelerator has been selected.

Best,
T.C

@ananthsub
Copy link
Contributor Author

ananthsub commented Aug 20, 2021

@tchaton - the idea is that the collectives would sit in the training type plugin for control purposes. this way, the plugin selects the collective implementation. It's very similar to how the checkpoint IO plugin is a super simple interface owned in the training type. Lightning can offer default implementations for torch distributed, XLA, and Horovod. Or users could provide their own implementation in their own custom training type plugin.

This way, we at least wrangle all of the different collectives used inside of the trainer.

Regarding the utility functions or whether/how this is exposed to users, I think that will be a secondary phase of this once we clean up the trainer side. Many of the utilities Lightning offers today are specific to torch distributed, so even now we have uneven coverage for TPU and Horovod. I don't think this collective class should try to cover all of those because the backend implementations can be really different, and there's no guarantee that all backends support all of these collectives.

We're also discussing with the torch distributed side whether a registration mechanism would allow using XLA with the same torch distributed APIs, at which point this becomes much easier to manage.

@tchaton
Copy link
Contributor

tchaton commented Aug 20, 2021

Thanks for the clarification. That's make sense as it would reduce the number of hooks on the TrainingTypePlugin and decrease it high level responsabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic feature Is an improvement or enhancement let's do it! approved to implement refactor
Projects
None yet
8 participants