Wraps sharded model for proper access to it `state_dict` in `FSDP` strategy #16558

SpirinEgor · 2023-01-30T13:12:03Z

What does this PR do?

Fixes #16526 by following previously deleted DDPFullyShardedStrategy

Does your PR introduce any breaking changes?

No, it doesn't.

for more information, see https://pre-commit.ci

carmocca

Thanks for looking into this!

This is only an issue in master, not 1.9, correct? In that case, this doesn't need a CHANGELOG entry.

Can you write a test?

src/pytorch_lightning/strategies/fsdp.py

SpirinEgor · 2023-01-30T13:53:42Z

This is only an issue in master, not 1.9, correct?

No, I faced this in 1.9 in the fsdp_native strategy. The master renames it to fsdp without any fixes from ddp_fully_sharded that are removed.

for more information, see https://pre-commit.ci

SpirinEgor · 2023-01-31T11:41:38Z

@carmocca I wrote tests for this. And I notice that the problem is actually occurred only for layers that is not wrapped, e.g. small layers. So, I parametrized tests with different wrapping policies and slightly changed BoringModel for correct asserts.

My build has one failed test, but it seems not my fault. What should I do?

And do you have any suggestions about modifying _LightningModuleWrapperBase?

for more information, see https://pre-commit.ci

SpirinEgor · 2023-04-14T15:40:40Z

@awaelchli I implemented what we discussed. For now, FSDP always aggregate full state dict on zero rank. Cpu offload depends on CPUOffload from initialized strategy. I'm not sure if this is okay, but this was enough to pass tests 😅 Setting cpu_offload=True in FullStateDictConfig lead to errors on CI, but not in my environment (details).

SpirinEgor · 2023-04-17T14:18:11Z

I rethought this logic with offloading to the CPU. It's not good to reuse this variable as it's intended for a completely different purpose. We need to figure out why this doesn't work on CI. Because it works on my setup (2xA100).

awaelchli · 2023-04-17T20:39:35Z

I'm looking into it!

for more information, see https://pre-commit.ci

…gor/master

for more information, see https://pre-commit.ci

Add additional wrapping to handle sharded model

2752e9d

SpirinEgor requested review from williamFalcon, awaelchli, carmocca and justusschock as code owners January 30, 2023 13:12

github-actions bot added the pl Generic label for PyTorch Lightning package label Jan 30, 2023

SpirinEgor and others added 6 commits January 30, 2023 17:12

Merge branch 'master' into master

d3ad5e7

[pre-commit.ci] auto fixes from pre-commit.com hooks

f070509

for more information, see https://pre-commit.ci

Use tuple to define yield type in Iterator

191f876

Merge remote-tracking branch 'origin/master'

0b8a680

Use tuple to define yield type in Iterator

f5f2158

[pre-commit.ci] auto fixes from pre-commit.com hooks

79431b7

for more information, see https://pre-commit.ci

carmocca reviewed Jan 30, 2023

View reviewed changes

src/pytorch_lightning/strategies/fsdp.py Outdated Show resolved Hide resolved

carmocca added this to the v1.9.x milestone Jan 30, 2023

carmocca added bug Something isn't working strategy: fsdp Fully Sharded Data Parallel labels Jan 30, 2023

SpirinEgor and others added 2 commits January 31, 2023 14:36

Add tests for checking state_dict extraction

7dbf02d

[pre-commit.ci] auto fixes from pre-commit.com hooks

a6d80e2

for more information, see https://pre-commit.ci

SpirinEgor requested a review from carmocca January 31, 2023 14:39

Merge branch 'master' into master

53bef88

mergify bot added the has conflicts label Feb 1, 2023

Merge remote-tracking branch 'upstream/master'

b39b5e0

mergify bot removed the has conflicts label Feb 2, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

d775d46

for more information, see https://pre-commit.ci

pre-commit-ci bot requested review from tchaton, lantiga, hhsecond and ethanwharris as code owners February 2, 2023 07:52

mergify bot removed the has conflicts label Apr 13, 2023

Merge branch 'master' into master

a09ee14

github-actions bot removed ci Continuous Integration app (removed) Generic label for Lightning App package fabric lightning.fabric.Fabric labels Apr 13, 2023

SpirinEgor and others added 6 commits April 13, 2023 21:27

Revert unnecessary changes in tests

f763c09

Always offload checkpoint to CPU

ed82d00

Validate state_dict only on zero rank

bbd684b

Offload to CPU only if trainer uses offload

8e4804e

Merge branch 'master' into master

14d3d8b

Merge branch 'master' into master

26e9ea7

SpirinEgor requested a review from awaelchli April 14, 2023 15:35

SpirinEgor and others added 2 commits April 17, 2023 16:54

Merge branch 'master' into master

d998776

Always offload checkpoint to CPU

0c1568a

awaelchli and others added 7 commits April 17, 2023 17:08

update tests

7a1b40f

move function to bottom

6b107bc

[pre-commit.ci] auto fixes from pre-commit.com hooks

638a374

for more information, see https://pre-commit.ci

update documentation

62441e9

Merge branch 'master' of github.com:SpirinEgor/lightning into SpirinE…

2fc3834

…gor/master

[pre-commit.ci] auto fixes from pre-commit.com hooks

99c0f39

for more information, see https://pre-commit.ci

add changelog

9d16758

awaelchli approved these changes Apr 17, 2023

View reviewed changes

awaelchli added the community This PR is from the community label Apr 17, 2023

Borda approved these changes Apr 17, 2023

View reviewed changes

Borda merged commit bb4e495 into Lightning-AI:master Apr 17, 2023

mergify bot added the ready PRs ready to be merged label Apr 17, 2023

carmocca mentioned this pull request May 10, 2023

FSDP error when load from state_dict #17566

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wraps sharded model for proper access to it `state_dict` in `FSDP` strategy #16558

Wraps sharded model for proper access to it `state_dict` in `FSDP` strategy #16558

SpirinEgor commented Jan 30, 2023

carmocca left a comment

SpirinEgor commented Jan 30, 2023 •

edited

Loading

SpirinEgor commented Jan 31, 2023 •

edited

Loading

SpirinEgor commented Apr 14, 2023

SpirinEgor commented Apr 17, 2023

awaelchli commented Apr 17, 2023

Wraps sharded model for proper access to it state_dict in FSDP strategy #16558

Wraps sharded model for proper access to it state_dict in FSDP strategy #16558

Conversation

SpirinEgor commented Jan 30, 2023

What does this PR do?

Does your PR introduce any breaking changes?

carmocca left a comment

Choose a reason for hiding this comment

SpirinEgor commented Jan 30, 2023 • edited Loading

SpirinEgor commented Jan 31, 2023 • edited Loading

SpirinEgor commented Apr 14, 2023

SpirinEgor commented Apr 17, 2023

awaelchli commented Apr 17, 2023

Wraps sharded model for proper access to it `state_dict` in `FSDP` strategy #16558

Wraps sharded model for proper access to it `state_dict` in `FSDP` strategy #16558

SpirinEgor commented Jan 30, 2023 •

edited

Loading

SpirinEgor commented Jan 31, 2023 •

edited

Loading