Consolidate collective functions #7534

ananthsub · 2021-05-13T23:48:00Z

🚀 Feature

Lightning should offer a central place to use the collective functions provided here: https://pytorch.org/docs/stable/distributed.html#collective-functions

Motivation

LightningModule code is usually agnostic to what device its running on or whether its running in a distributed training environment. However, there are times where the module does need to rely on collective functions.

In Lightning, we currently have many places where these are offered:

On this distributed object, which only supports broadcast: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/distributed/dist.py
reduce, barrier, broadcast, all_gather, and reduce_boolean_decision are on the trainer's accelerator and training type plugin:
https://github.com/PyTorchLightning/pytorch-lightning/blob/233f252bb427c930be8e7ca56fe115b637278b8d/pytorch_lightning/accelerators/accelerator.py#L431-L455
https://github.com/PyTorchLightning/pytorch-lightning/blob/233f252bb427c930be8e7ca56fe115b637278b8d/pytorch_lightning/plugins/training_type/training_type_plugin.py#L78-L103
more utilities for gathering tensors, all_gather, and sync_ddp here: https://github.com/PyTorchLightning/pytorch-lightning/blob/b9a52fa2ef31f12f6992ece18a033318ec551907/pytorch_lightning/utilities/distributed.py#L86-L217
all_gather repeated again here on the lightning module, calling the trainer's accelerator functions: https://github.com/PyTorchLightning/pytorch-lightning/blob/233f252bb427c930be8e7ca56fe115b637278b8d/pytorch_lightning/core/lightning.py#L506-L532

Some of these call each other and the dependency isn't very clear now, so it is confusing for users which to go through.

Pitch

Offer these utilities under a central place: pytorch_lightning/utilities/collectives.py for these utilities:
barrier, all_gather, broadcast, etc

These should be very thin wrappers over the PyTorch distributed functions, checking if torch.distributed is available and initialized. If not, we return what's expected for single-process training.

Update the callsites internally to use to these implementations
Mark existing functions as deprecated and slated for removal in v1.6

cc @Borda @awaelchli @rohitgr7 @akihironitta @justusschock

The text was updated successfully, but these errors were encountered:

bowangbj · 2021-08-10T00:19:58Z

Thanks Ananth and others. Based on the inputs, we came up with a mini-proposal: https://docs.google.com/document/d/1e83FcZHHHsTmiUmNpgmPTBYjugZaC3pcZhd4VV9AuIM/edit

Looking forward to the inputs.

ananthsub · 2021-08-10T00:53:22Z

@tchaton @carmocca @awaelchli @justusschock please take a look!

justusschock · 2021-08-10T07:11:50Z

@ananthsub @bowangbj I left some comments and in general I really like the idea :)

awaelchli · 2021-08-10T07:19:30Z

The Collective class is a nice addition. I like it as well!

bowangbj · 2021-08-13T23:26:44Z

Thanks Justus, Adrian and all for the super quick review.
Addressed most of the comments in the doc.
Starting implementation ...

tchaton · 2021-08-20T16:36:14Z

Hey @ananthsub, I definitely like the idea.

However, would the collective be Trainer aware as It don't think we could support collective for TPU / Horovod / etc.. without nothing which accelerator has been selected.

Best,
T.C

ananthsub · 2021-08-20T17:27:36Z

@tchaton - the idea is that the collectives would sit in the training type plugin for control purposes. this way, the plugin selects the collective implementation. It's very similar to how the checkpoint IO plugin is a super simple interface owned in the training type. Lightning can offer default implementations for torch distributed, XLA, and Horovod. Or users could provide their own implementation in their own custom training type plugin.

This way, we at least wrangle all of the different collectives used inside of the trainer.

Regarding the utility functions or whether/how this is exposed to users, I think that will be a secondary phase of this once we clean up the trainer side. Many of the utilities Lightning offers today are specific to torch distributed, so even now we have uneven coverage for TPU and Horovod. I don't think this collective class should try to cover all of those because the backend implementations can be really different, and there's no guarantee that all backends support all of these collectives.

We're also discussing with the torch distributed side whether a registration mechanism would allow using XLA with the same torch distributed APIs, at which point this becomes much easier to manage.

tchaton · 2021-08-20T18:24:56Z

Thanks for the clarification. That's make sense as it would reduce the number of hooks on the TrainingTypePlugin and decrease it high level responsabilities.

ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on refactor labels May 13, 2021

ananthsub added this to the v1.4 milestone May 13, 2021

ananthsub self-assigned this May 13, 2021

ananthsub changed the title ~~Offer collective functions under Lightning utilities~~ Consolidate collective functions May 14, 2021

ananthsub mentioned this issue May 22, 2021

Move sync code from step result to lightning module [6/n] #7651

Merged

8 tasks

edenlightning modified the milestones: v1.4, v1.5 Jul 6, 2021

ananthsub mentioned this issue Aug 5, 2021

Allow passing variables from calls of training_step when using multiple optimizers #8740

Closed

ananthsub mentioned this issue Aug 19, 2021

Deprecate the BasePlugin #8988

Closed

awaelchli added the let's do it! approved to implement label Aug 21, 2021

ananthsub mentioned this issue Aug 24, 2021

[RFC] Introduce strategy flag to Trainer #9053

Closed

four4fish removed the help wanted Open to be worked on label Sep 8, 2021

four4fish mentioned this issue Sep 9, 2021

2/n Consolidate collective functions - collective base and subclasses #9414

Closed

12 tasks

This was referenced Sep 11, 2021

[RFC] Directly call TrainingTypePlugin APIs instead of going through the Accelerator #9426

Closed

3/n Consolidate collective functions - Integrate with TrainingTypePlugin #9472

Closed

four4fish mentioned this issue Sep 24, 2021

1/n Call training_type_plugin collective functions directly instead of going through the Accelerator #9677

Merged

12 tasks

This was referenced Sep 25, 2021

Make HorovodPlugin.all_gather return a tensor #9696

Merged

Deprecate LightningDistributed and keep broadcast in ddp/ddpSpawn directly #9692

Closed

awaelchli modified the milestones: v1.5, v1.6 Nov 4, 2021

four4fish mentioned this issue Nov 8, 2021

[Main Issue] Accelerator and Plugin refactor #10416

Closed

4 tasks

ananthsub mentioned this issue Dec 9, 2021

Update invariant for collectives in spawn plugins #11020

Closed

12 tasks

carmocca modified the milestones: 1.6, 1.7 Feb 1, 2022

carmocca added the distributed Generic distributed-related topic label Feb 1, 2022

carmocca modified the milestones: pl:1.7, future Jul 19, 2022

carmocca mentioned this issue Jul 25, 2022

Use DeepSpeed Communication #13821

Open

carmocca unassigned ananthsub Jul 25, 2022

carmocca mentioned this issue Sep 12, 2022

Provide a backend agnostic Join for LightningLite #14635

Open

carmocca mentioned this issue Oct 5, 2022

Lite's collectives feature #14996

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate collective functions #7534

Consolidate collective functions #7534

ananthsub commented May 13, 2021 •

edited by github-actions bot

Loading

bowangbj commented Aug 10, 2021 •

edited

Loading

ananthsub commented Aug 10, 2021

justusschock commented Aug 10, 2021

awaelchli commented Aug 10, 2021

bowangbj commented Aug 13, 2021

tchaton commented Aug 20, 2021

ananthsub commented Aug 20, 2021 •

edited

Loading

tchaton commented Aug 20, 2021

Consolidate collective functions #7534

Consolidate collective functions #7534

Comments

ananthsub commented May 13, 2021 • edited by github-actions bot Loading

🚀 Feature

Motivation

Pitch

bowangbj commented Aug 10, 2021 • edited Loading

ananthsub commented Aug 10, 2021

justusschock commented Aug 10, 2021

awaelchli commented Aug 10, 2021

bowangbj commented Aug 13, 2021

tchaton commented Aug 20, 2021

ananthsub commented Aug 20, 2021 • edited Loading

tchaton commented Aug 20, 2021

ananthsub commented May 13, 2021 •

edited by github-actions bot

Loading

bowangbj commented Aug 10, 2021 •

edited

Loading

ananthsub commented Aug 20, 2021 •

edited

Loading