-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate collective functions #7534
Comments
Thanks Ananth and others. Based on the inputs, we came up with a mini-proposal: https://docs.google.com/document/d/1e83FcZHHHsTmiUmNpgmPTBYjugZaC3pcZhd4VV9AuIM/edit Looking forward to the inputs. |
@tchaton @carmocca @awaelchli @justusschock please take a look! |
@ananthsub @bowangbj I left some comments and in general I really like the idea :) |
The Collective class is a nice addition. I like it as well! |
Thanks Justus, Adrian and all for the super quick review. |
Hey @ananthsub, I definitely like the idea. However, would the collective be Trainer aware as It don't think we could support collective for TPU / Horovod / etc.. without nothing which accelerator has been selected. Best, |
@tchaton - the idea is that the collectives would sit in the training type plugin for control purposes. this way, the plugin selects the collective implementation. It's very similar to how the checkpoint IO plugin is a super simple interface owned in the training type. Lightning can offer default implementations for torch distributed, XLA, and Horovod. Or users could provide their own implementation in their own custom training type plugin. This way, we at least wrangle all of the different collectives used inside of the trainer. Regarding the utility functions or whether/how this is exposed to users, I think that will be a secondary phase of this once we clean up the trainer side. Many of the utilities Lightning offers today are specific to torch distributed, so even now we have uneven coverage for TPU and Horovod. I don't think this collective class should try to cover all of those because the backend implementations can be really different, and there's no guarantee that all backends support all of these collectives. We're also discussing with the torch distributed side whether a registration mechanism would allow using XLA with the same torch distributed APIs, at which point this becomes much easier to manage. |
Thanks for the clarification. That's make sense as it would reduce the number of hooks on the TrainingTypePlugin and decrease it high level responsabilities. |
🚀 Feature
Lightning should offer a central place to use the collective functions provided here: https://pytorch.org/docs/stable/distributed.html#collective-functions
Motivation
LightningModule code is usually agnostic to what device its running on or whether its running in a distributed training environment. However, there are times where the module does need to rely on collective functions.
In Lightning, we currently have many places where these are offered:
On this distributed object, which only supports
broadcast
: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/distributed/dist.pyreduce
,barrier
,broadcast
,all_gather
, andreduce_boolean_decision
are on the trainer's accelerator and training type plugin:https://github.com/PyTorchLightning/pytorch-lightning/blob/233f252bb427c930be8e7ca56fe115b637278b8d/pytorch_lightning/accelerators/accelerator.py#L431-L455
https://github.com/PyTorchLightning/pytorch-lightning/blob/233f252bb427c930be8e7ca56fe115b637278b8d/pytorch_lightning/plugins/training_type/training_type_plugin.py#L78-L103
more utilities for gathering tensors, all_gather, and sync_ddp here: https://github.com/PyTorchLightning/pytorch-lightning/blob/b9a52fa2ef31f12f6992ece18a033318ec551907/pytorch_lightning/utilities/distributed.py#L86-L217
all_gather
repeated again here on the lightning module, calling the trainer's accelerator functions: https://github.com/PyTorchLightning/pytorch-lightning/blob/233f252bb427c930be8e7ca56fe115b637278b8d/pytorch_lightning/core/lightning.py#L506-L532Some of these call each other and the dependency isn't very clear now, so it is confusing for users which to go through.
Pitch
pytorch_lightning/utilities/collectives.py
for these utilities:barrier
,all_gather
,broadcast
, etcThese should be very thin wrappers over the PyTorch distributed functions, checking if torch.distributed is available and initialized. If not, we return what's expected for single-process training.
Update the callsites internally to use to these implementations
Mark existing functions as deprecated and slated for removal in v1.6
cc @Borda @awaelchli @rohitgr7 @akihironitta @justusschock
The text was updated successfully, but these errors were encountered: