Add model summary when using DeepSpeed Stage 3 #13427

SeanNaren · 2022-06-28T09:18:01Z

What does this PR do?

Introduces a DeepSpeedModelSummary that includes logic to take out the actual size of tensors + show you the size of partitions made by DeepSpeed. Previously the weights were "0" due to DeepSpeed changing the parameters in place. Now you get something like this:

  | Name  | Type                       | Params | Params per Device
-------------------------------------------------------------------------
0 | ptlm  | T5ForConditionalGeneration | 737 M  | 184 M
1 | layer | Linear                     | 66     | 17
-------------------------------------------------------------------------
737 M    Trainable params
0        Non-trainable params
737 M    Total params

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @Borda @awaelchli @ananthsub @rohitgr7 @SeanNaren @akihironitta

carmocca

Nice addition!

tests/tests_pytorch/callbacks/test_deepspeed_model_summary.py

src/pytorch_lightning/callbacks/deepspeed_model_summary.py

rohitgr7

Also fixes: #10291?

src/pytorch_lightning/callbacks/deepspeed_model_summary.py

SeanNaren · 2022-06-28T14:41:05Z

Also fixes: #10291?

Doesn't handle FSDP right now, we can add that after!

src/pytorch_lightning/callbacks/model_summary.py

awaelchli

Sorry, I had the review done yesterday but didn't submit it :(

awaelchli · 2022-06-28T18:29:29Z

src/pytorch_lightning/CHANGELOG.md

@@ -280,6 +280,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed `estimated_stepping_batches` requiring distributed comms in `configure_optimizers` for the `DeepSpeedStrategy` ([#13350](https://github.com/PyTorchLightning/pytorch-lightning/pull/13350))


+- Fixed Model Summary when using DeepSpeed Stage 3 ([#13427](https://github.com/PyTorchLightning/pytorch-lightning/pull/13427))


I suggest either

Suggested change

- Fixed Model Summary when using DeepSpeed Stage 3 ([#13427](https://github.com/PyTorchLightning/pytorch-lightning/pull/13427))

- Fixed model summary when using DeepSpeed Stage 3 ([#13427](https://github.com/PyTorchLightning/pytorch-lightning/pull/13427))

or

Suggested change

- Fixed Model Summary when using DeepSpeed Stage 3 ([#13427](https://github.com/PyTorchLightning/pytorch-lightning/pull/13427))

- Fixed ModelSummary callback when using DeepSpeed Stage 3 ([#13427](https://github.com/PyTorchLightning/pytorch-lightning/pull/13427))

awaelchli · 2022-06-29T15:20:38Z

src/pytorch_lightning/utilities/deepspeed_model_summary.py

@@ -0,0 +1,94 @@
+#!/usr/bin/env python


for organisation, it would be nice if it was called model_summary_deepspeed or if it was grouped under a folder model_summary

awaelchli · 2022-06-29T15:22:35Z

tests/tests_pytorch/callbacks/test_deepspeed_model_summary.py

+
+
+@RunIf(min_cuda_gpus=2, deepspeed=True, standalone=True)
+def test_deepspeed_summary(tmpdir):


This test is very expensive for just testing a model summary.
Is there no way we can just test the summary on a model directly without full training and launching processes?

SeanNaren added 2 commits June 28, 2022 09:58

Add summary capabilities when using DeepSpeed Stage 3

6affcc6

Remove prints

4443c3b

SeanNaren added bug Something isn't working strategy: deepspeed labels Jun 28, 2022

SeanNaren added this to the pl:1.6.x milestone Jun 28, 2022

SeanNaren self-assigned this Jun 28, 2022

SeanNaren requested review from tchaton, carmocca, Borda, williamFalcon, kaushikb11, awaelchli, justusschock and rohitgr7 as code owners June 28, 2022 09:18

SeanNaren added 4 commits June 28, 2022 10:18

Fix layer set

5ab0275

Add CHANGELOG.md

6e1fda6

Fix type

f23406e

Ignore override

8991b0f

carmocca approved these changes Jun 28, 2022

View reviewed changes

tests/tests_pytorch/callbacks/test_deepspeed_model_summary.py Show resolved Hide resolved

tests/tests_pytorch/callbacks/test_deepspeed_model_summary.py Outdated Show resolved Hide resolved

src/pytorch_lightning/callbacks/deepspeed_model_summary.py Outdated Show resolved Hide resolved

carmocca added feature Is an improvement or enhancement callback and removed bug Something isn't working labels Jun 28, 2022

SeanNaren added 2 commits June 28, 2022 13:57

Address review

c0d3578

Remove type

de8e762

SeanNaren modified the milestones: pl:1.6.x, pl:1.7 Jun 28, 2022

SeanNaren added 2 commits June 28, 2022 14:09

Fix types

317a1ac

Typing fixes

9dd4d99

justusschock approved these changes Jun 28, 2022

View reviewed changes

src/pytorch_lightning/callbacks/deepspeed_model_summary.py Outdated Show resolved Hide resolved

mergify bot added the ready PRs ready to be merged label Jun 28, 2022

rohitgr7 approved these changes Jun 28, 2022

View reviewed changes

src/pytorch_lightning/callbacks/deepspeed_model_summary.py Outdated Show resolved Hide resolved

Combine into a single callback

3619d76

rohitgr7 reviewed Jun 28, 2022

View reviewed changes

src/pytorch_lightning/callbacks/model_summary.py Show resolved Hide resolved

src/pytorch_lightning/callbacks/model_summary.py Show resolved Hide resolved

SeanNaren added 2 commits June 28, 2022 17:15

Merge branch 'master' into feat/deepspeed_summary

e691a29

Merge branch 'master' into feat/deepspeed_summary

6692eee

Borda approved these changes Jun 29, 2022

View reviewed changes

Merge branch 'master' into feat/deepspeed_summary

2fcda44

SeanNaren enabled auto-merge (squash) June 29, 2022 14:08

SeanNaren merged commit f145acd into master Jun 29, 2022

SeanNaren deleted the feat/deepspeed_summary branch June 29, 2022 14:49

awaelchli reviewed Jun 29, 2022

View reviewed changes

awaelchli mentioned this pull request Jul 1, 2022

Move deepspeed summary test to correct folder #13478

Merged

11 tasks

jerome-habana pushed a commit to jerome-habana/lightning that referenced this pull request Jul 14, 2022

Add model summary when using DeepSpeed Stage 3 (Lightning-AI#13427)

2954714

awaelchli mentioned this pull request Jul 27, 2022

Organize model summary utilities #13893

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add model summary when using DeepSpeed Stage 3 #13427

Add model summary when using DeepSpeed Stage 3 #13427

SeanNaren commented Jun 28, 2022 •

edited

Loading

carmocca left a comment

rohitgr7 left a comment

SeanNaren commented Jun 28, 2022

awaelchli left a comment

awaelchli Jun 28, 2022

awaelchli Jun 29, 2022

awaelchli Jun 29, 2022

		@@ -280,6 +280,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
		- Fixed `estimated_stepping_batches` requiring distributed comms in `configure_optimizers` for the `DeepSpeedStrategy` ([#13350](https://github.com/PyTorchLightning/pytorch-lightning/pull/13350))


		- Fixed Model Summary when using DeepSpeed Stage 3 ([#13427](https://github.com/PyTorchLightning/pytorch-lightning/pull/13427))

	- Fixed Model Summary when using DeepSpeed Stage 3 ([#13427](https://github.com/PyTorchLightning/pytorch-lightning/pull/13427))
	- Fixed ModelSummary callback when using DeepSpeed Stage 3 ([#13427](https://github.com/PyTorchLightning/pytorch-lightning/pull/13427))



		@RunIf(min_cuda_gpus=2, deepspeed=True, standalone=True)
		def test_deepspeed_summary(tmpdir):

Add model summary when using DeepSpeed Stage 3 #13427

Add model summary when using DeepSpeed Stage 3 #13427

Conversation

SeanNaren commented Jun 28, 2022 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

carmocca left a comment

Choose a reason for hiding this comment

rohitgr7 left a comment

Choose a reason for hiding this comment

SeanNaren commented Jun 28, 2022

awaelchli left a comment

Choose a reason for hiding this comment

awaelchli Jun 28, 2022

Choose a reason for hiding this comment

awaelchli Jun 29, 2022

Choose a reason for hiding this comment

awaelchli Jun 29, 2022

Choose a reason for hiding this comment

SeanNaren commented Jun 28, 2022 •

edited

Loading