Speed up loading blocks using init with meta weights #285

mryab · 2023-03-11T17:27:24Z

This PR changes the block initialization logic to use PyTorch meta tensors before assigning values from state_dict. This helps avoid unnecessary memory allocation and parameter initialization steps, which might take a lot of time (and RAM) for large models.

Before this PR: 1 block loaded in ~16 seconds

python -m petals.cli.run_server bigscience/bloom-petals --new_swarm --throughput 100 --num_blocks 1
Mar 11 20:28:06.177 [INFO] Running Petals 1.1.3
Mar 11 20:28:07.274 [INFO] This server is accessible directly
Mar 11 20:28:07.714 [INFO] Connecting to a private swarm, initial peers: []
Mar 11 20:28:07.715 [INFO] Running a server on ['/ip4/172.27.77.70/tcp/41059/p2p/12D3KooWA4oco1QGgLWRuJdVoyhCv1559B9x1tAjvomuKN8ZwYrY', '/ip4/127.0.0.1/tcp/41059/p2p/12D3KooWA4oco1QGgLWRuJdVoyhCv1559B9x1tAjvomuKN8ZwYrY', '/ip6/2a02:6b8:0:3201:d9cc:83ae:8057:2b4e/tcp/40189/p2p/12D3KooWA4oco1QGgLWRuJdVoyhCv1559B9x1tAjvomuKN8ZwYrY', '/ip6/::1/tcp/40189/p2p/12D3KooWA4oco1QGgLWRuJdVoyhCv1559B9x1tAjvomuKN8ZwYrY']
Mar 11 20:28:07.752 [INFO] Model weights will be loaded in 8-bit format
Mar 11 20:28:07.753 [INFO] Attention cache for all blocks will consume up to 0.50 GiB
Mar 11 20:28:07.843 [INFO] Reachability service started
Mar 11 20:28:11.264 [INFO] Announced that blocks [0] are joining
Mar 11 20:28:27.418 [INFO] Loaded bigscience/bloom-petals block 0, <All keys matched successfully>

After: 1 block loaded in ~2 seconds

python -m petals.cli.run_server bigscience/bloom-petals --new_swarm --throughput 100 --num_blocks 1
Mar 11 20:27:32.121 [INFO] Running Petals 1.1.3
Mar 11 20:27:33.257 [INFO] This server is accessible directly
Mar 11 20:27:33.700 [INFO] Connecting to a private swarm, initial peers: []
Mar 11 20:27:33.701 [INFO] Running a server on ['/ip4/172.27.77.70/tcp/40367/p2p/12D3KooWLgrNbQbxYaG9W5CJUwNz68fpe8qjm3x4CWDLbzFg2TzY', '/ip4/127.0.0.1/tcp/40367/p2p/12D3KooWLgrNbQbxYaG9W5CJUwNz68fpe8qjm3x4CWDLbzFg2TzY', '/ip6/2a02:6b8:0:3201:d9cc:83ae:8057:2b4e/tcp/37407/p2p/12D3KooWLgrNbQbxYaG9W5CJUwNz68fpe8qjm3x4CWDLbzFg2TzY', '/ip6/::1/tcp/37407/p2p/12D3KooWLgrNbQbxYaG9W5CJUwNz68fpe8qjm3x4CWDLbzFg2TzY']
Mar 11 20:27:33.730 [INFO] Model weights will be loaded in 8-bit format
Mar 11 20:27:33.731 [INFO] Attention cache for all blocks will consume up to 0.50 GiB
Mar 11 20:27:33.861 [INFO] Reachability service started
Mar 11 20:27:34.968 [INFO] Announced that blocks [0] are joining
Mar 11 20:27:36.625 [INFO] Loaded bigscience/bloom-petals block 0, <All keys matched successfully>

tests/test_block_exact_match.py

src/petals/bloom/from_pretrained.py

borzunov · 2023-03-12T01:39:22Z

Awesome results!

The total block loading time (including the time to move it on GPU) went down from 57 sec to 25 sec on my machines. This means that, given that the blocks are already downloaded, the servers will spend 2x less time for restarting after being preempted or during rebalancing.

Co-authored-by: Alexander Borzunov <[email protected]>

- After #285, `load_pretrained_block()` uses `accelerate.utils.set_module_tensor_to_device()` - In accelerate>=0.16.0, it saves the tensor in the dtype previously used by the model instead of dtype of the weights (huggingface/accelerate#920) - Because of that, blocks and attention caches used float32, which caused OOMs - This PR makes `load_pretrained_block()` respect `torch_dtype` (default: `"auto"`, which means reading `torch_dtype` from `config.json`)

mryab added 4 commits March 11, 2023 18:27

Init WrappedBloomBlock with meta weights

025b9f5

Add newline character to pyproject

24cf1bb

Remove requires_grad

8861b23

Add pytest.mark.forked

37a7c67

mryab requested review from borzunov and justheuristic and removed request for justheuristic March 11, 2023 18:25

borzunov reviewed Mar 12, 2023

View reviewed changes

tests/test_block_exact_match.py Outdated Show resolved Hide resolved

borzunov reviewed Mar 12, 2023

View reviewed changes

src/petals/bloom/from_pretrained.py Outdated Show resolved Hide resolved

borzunov changed the title ~~Init WrappedBloomBlock with meta weights~~ Speed up loading blocks by running init on meta device Mar 12, 2023

borzunov changed the title ~~Speed up loading blocks by running init on meta device~~ Speed up loading blocks using init with meta weights Mar 12, 2023

mryab and others added 3 commits March 12, 2023 13:08

Update src/petals/bloom/from_pretrained.py

f299b7b

Co-authored-by: Alexander Borzunov <[email protected]>

Fix test_init_pretrained_block

8aec8b4

Fix formatting in the test

9f320cd

borzunov approved these changes Mar 12, 2023

View reviewed changes

justheuristic approved these changes Mar 12, 2023

View reviewed changes

mryab merged commit 793726b into main Mar 12, 2023

mryab deleted the init_empty_weights branch March 12, 2023 21:49

borzunov mentioned this pull request Apr 25, 2023

Fix OOMs happening in case of accelerate >= 0.16.0 #310

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up loading blocks using init with meta weights #285

Speed up loading blocks using init with meta weights #285

Uh oh!

mryab commented Mar 11, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

borzunov commented Mar 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Speed up loading blocks using init with meta weights #285

Speed up loading blocks using init with meta weights #285

Uh oh!

Conversation

mryab commented Mar 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

borzunov commented Mar 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mryab commented Mar 11, 2023 •

edited

Loading