Improved model initialization API for Fabric #17462

awaelchli · 2023-04-24T20:11:44Z

What does this PR do?

Adds a new context manager in Fabric that allows you to init your model:

directly on the right device
with the right device type
getting sharded instantly
allow for empty weight initialization (meta device) or deferred init (torchdistx) (Adopt FakeTensorMode for FSDP #16448)

This context manager genralized the previous sharded_model manager from the LightningLite-days (we don't need both).

Example 1:

Init the model in the GPU instantly and with weights in half precision (your model may not fit in float32).

fabric = Fabric(accelerator="cuda", precision="bf16-true")

with fabric.init_module():
    model = MyModel()

See #17287 for half support.

Example 2:
Init FSDP model on the meta device (or using torchdistx).

fabric = Fabric(accelerator="cuda", strategy="fsdp")

with fabric.init_module():
    model = MyModel()  # model params are now on meta device (no memory allocated)

model = fabric.setup(model)  # params get sharded and put on device

Example 3:
(Future) Init the model with empty weights explicitly (no memory allocated) if you later need to overwite by loading a checkpoint anyway:

fabric = Fabric(...)

with fabric.init_module(empty_weights=True):
    model = MyModel()
    model.load_state_dict(...)

This API can be used for #16448 or to support torchdistx.

cc @Borda @carmocca @justusschock @awaelchli

for more information, see https://pre-commit.ci

…ic/half-precision

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci

Co-authored-by: Carlos Mocholí <[email protected]>

for more information, see https://pre-commit.ci

carmocca

Great job. Since you might prefer to address some of my comments in a follow-up, approving to unblock.

We should also add this to the fabric docs

carmocca · 2023-04-25T20:44:49Z

src/lightning/fabric/fabric.py

-                model = MyModel()
+    @contextmanager
+    def init_module(self) -> Generator:
+        """Instantiate the model and its parameters under this context manager to reduce peak memory usage.


This might also be used in the future for things that arent modules, maybe we should choose a different name that doesnt explicitly state "module"

init_module is fine IMO, but if we want to discuss names :-) I find that the name doesn't convey what is different from just initializing outside the context manager.

Maybe direct_init would be an alternative? I'm good with init_module though, it probably sounds less cryptic.

I was thinking mentioning the "module" in the name is safer because it expresses more precisely what we should do under this context manager. This hopefully helps prevent users from accidentally doing weird stuff under this context manager and getting in trouble, but of course, we can never completely prevent that.

I would vote for something like efficient_init or similar

For example, this context manager woudl be useful for non-modules too:

import torch from lightning import Fabric fabric = Fabric(accelerator="cuda", precision="64-true") with fabric.efficient_init(): x = torch.zeros(1) print(x.device) # cuda print(x.dtype) # 64

It is very important that this context manager be only used for model initialization. For everything else, the user should use fabric.device.

The main motivation for this is to load and shard large models efficiently, or to provide a convenient way to cast the model to the desired dtype for inference without code changes.

lantiga

Looks great

lantiga · 2023-04-26T07:02:28Z

src/lightning/fabric/fabric.py

-        with self.sharded_model(), _replace_dunder_methods(DataLoader, "dataset"), _replace_dunder_methods(
-            BatchSampler
-        ):
+        with _old_sharded_model_context(self._strategy), _replace_dunder_methods(


There wouldn't be any impact for current Fabric end-users (not developers) right? If so, I agree we should remove.

src/lightning/fabric/strategies/strategy.py

lantiga · 2023-04-26T07:13:48Z

src/lightning/fabric/fabric.py

-                model = MyModel()
+    @contextmanager
+    def init_module(self) -> Generator:
+        """Instantiate the model and its parameters under this context manager to reduce peak memory usage.


init_module is fine IMO, but if we want to discuss names :-) I find that the name doesn't convey what is different from just initializing outside the context manager.

Maybe direct_init would be an alternative? I'm good with init_module though, it probably sounds less cryptic.

…module-init

codecov · 2023-04-26T11:30:11Z

Codecov Report

Merging #17462 (758fbd7) into master (d48ec08) will decrease coverage by 24%.
The diff coverage is 89%.

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #17462     +/-   ##
=========================================
- Coverage      83%      59%    -24%     
=========================================
  Files         415      410      -5     
  Lines       31649    31594     -55     
=========================================
- Hits        26405    18708   -7697     
- Misses       5244    12886   +7642

src/lightning/fabric/fabric.py

awaelchli and others added 26 commits April 6, 2023 00:34

model instantiation

be48ffa

strategy implementations

5d68d2f

[pre-commit.ci] auto fixes from pre-commit.com hooks

df5c9ed

for more information, see https://pre-commit.ci

tests

1779460

[pre-commit.ci] auto fixes from pre-commit.com hooks

b92d056

for more information, see https://pre-commit.ci

connect precision

de21ae5

[pre-commit.ci] auto fixes from pre-commit.com hooks

3153e0c

for more information, see https://pre-commit.ci

tests

70c25df

ddp

3fb0c50

[pre-commit.ci] auto fixes from pre-commit.com hooks

ccc9b8d

for more information, see https://pre-commit.ci

update

d93b7c9

Merge remote-tracking branch 'origin/fabric/half-precision' into fabr…

cb94829

…ic/half-precision

[pre-commit.ci] auto fixes from pre-commit.com hooks

5f57343

for more information, see https://pre-commit.ci

ddp test

2fd241a

ddp test

9f80ea3

[pre-commit.ci] auto fixes from pre-commit.com hooks

5a72dca

for more information, see https://pre-commit.ci

reset

e1e1852

notebook

4c07eae

notebook

9b0f0de

notebook

bb2321f

add test

6a14bdd

[pre-commit.ci] auto fixes from pre-commit.com hooks

b6f8e1a

for more information, see https://pre-commit.ci

Merge branch 'master' into fabric/half-precision

71d9308

fsdp tests

66d3a20

[pre-commit.ci] auto fixes from pre-commit.com hooks

3207812

for more information, see https://pre-commit.ci

comments

1a852e1

github-actions bot added the fabric lightning.fabric.Fabric label Apr 24, 2023

reset

1805a60

[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci

awaelchli force-pushed the fabric/module-init branch from 076973f to 1805a60 Compare April 24, 2023 20:17

mergify bot mentioned this pull request Apr 24, 2023

True half-precision support in Fabric #17287

Merged

awaelchli and others added 5 commits April 25, 2023 12:30

Update src/lightning/fabric/fabric.py

16e6d86

Co-authored-by: Carlos Mocholí <[email protected]>

Update src/lightning/fabric/fabric.py

97404cf

Co-authored-by: Carlos Mocholí <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

68804bd

for more information, see https://pre-commit.ci

better changelog

00b7612

better changelog

0d31a1a

carmocca approved these changes Apr 25, 2023

View reviewed changes

lantiga approved these changes Apr 26, 2023

View reviewed changes

mergify bot added the ready PRs ready to be merged label Apr 26, 2023

Borda approved these changes Apr 26, 2023

View reviewed changes

Borda and others added 3 commits April 26, 2023 09:33

Merge branch 'master' into fabric/module-init

741ce22

add init_module to the method overview

cee0234

Merge remote-tracking branch 'origin/fabric/module-init' into fabric/…

d371400

…module-init

awaelchli requested a review from edenlightning as a code owner April 26, 2023 10:47

awaelchli added 2 commits April 26, 2023 13:12

warning for pytorch < 2.0

45513ce

precommit

758fbd7

carmocca reviewed Apr 26, 2023

View reviewed changes

src/lightning/fabric/fabric.py Show resolved Hide resolved

awaelchli merged commit 4d17b5f into master Apr 26, 2023

awaelchli deleted the fabric/module-init branch April 26, 2023 15:25

carmocca mentioned this pull request Apr 27, 2023

Split init_module into init + sharded_model #17488

Merged

awaelchli mentioned this pull request Apr 28, 2023

Update Fabric.init_module for FSDP #17510

Merged

This was referenced May 5, 2023

RFC: Future of init context managers in Fabric #17581

Closed

Address feedback for Fabric.init_module() (4/4) #17607

Merged

carmocca mentioned this pull request May 11, 2023

Bring over EmptyInitOnDevice for dtype initialization #17608

Closed

awaelchli mentioned this pull request May 12, 2023

Empty weight initialization through Fabric.init_module() #17616

Closed

This was referenced May 29, 2023

Address feedback for Fabric.init_module() (1/4) #17721

Merged

Address feedback for Fabric.init_module() (2/4) #17722

Merged

Address feedback for Fabric.init_module() (3/4) #17723

Merged

carmocca mentioned this pull request Jun 15, 2023

Remove automatic sharding support with Fabric.run or fabric.launch(fn) #17832

Merged

carmocca mentioned this pull request Jul 6, 2023

Add Trainer.init_module and LightningModule.configure_model #18004

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved model initialization API for Fabric #17462

Improved model initialization API for Fabric #17462

awaelchli commented Apr 24, 2023 •

edited

Loading

carmocca left a comment

carmocca Apr 25, 2023

lantiga Apr 26, 2023

awaelchli Apr 26, 2023

carmocca Apr 26, 2023

carmocca Apr 26, 2023 •

edited

Loading

awaelchli Apr 26, 2023

lantiga left a comment

lantiga Apr 26, 2023

lantiga Apr 26, 2023

codecov bot commented Apr 26, 2023 •

edited

Loading

Improved model initialization API for Fabric #17462

Improved model initialization API for Fabric #17462

Conversation

awaelchli commented Apr 24, 2023 • edited Loading

What does this PR do?

carmocca left a comment

Choose a reason for hiding this comment

carmocca Apr 25, 2023

Choose a reason for hiding this comment

lantiga Apr 26, 2023

Choose a reason for hiding this comment

awaelchli Apr 26, 2023

Choose a reason for hiding this comment

carmocca Apr 26, 2023

Choose a reason for hiding this comment

carmocca Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

awaelchli Apr 26, 2023

Choose a reason for hiding this comment

lantiga left a comment

Choose a reason for hiding this comment

lantiga Apr 26, 2023

Choose a reason for hiding this comment

lantiga Apr 26, 2023

Choose a reason for hiding this comment

codecov bot commented Apr 26, 2023 • edited Loading

Codecov Report

awaelchli commented Apr 24, 2023 •

edited

Loading

carmocca Apr 26, 2023 •

edited

Loading

codecov bot commented Apr 26, 2023 •

edited

Loading