Migrate distributed state dict API #2138

mori360 · 2024-12-10T05:34:13Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Migrate distributed state dict APIs from torch.distributed.

Changelog

What are the changes made in this PR?

Switch to distributed state dict APIs from torch.distributed.

load_from_full_model_state_dict <- set_model_state_dict
gather_cpu_state_dict <- get_model_state_dict
load_from_full_optimizer_state_dict <- set_optimizer_state_dict
get_full_optimizer_state_dict <- get_optimizer_state_dict

To align the inputs, add model input to get_full_optimizer_state_dict and load_from_full_optimizer_state_dict.
Change the sharded_sd input for gather_cpu_state_dict to model.

TODO:
nf4tensor are kept the same, remain as future work

Test plan

pytest tests/torchtune/training/test_distributed.py
pytest tests -m integration_test

We compare the running with the previous API and the new API, loss are the same in initial loading and resume from checkpoint.

We also draw the memory traces, results show that the new API won't cost mote memory peak comapred with the current ones.

pytorch-bot · 2024-12-10T05:34:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2138

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3d0d26f with merge base 002b17c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-12-11T02:30:41Z

Codecov Report

Attention: Patch coverage is 3.38983% with 57 lines in your changes missing coverage. Please review.

Project coverage is 65.26%. Comparing base (f2bd4bc) to head (8b575be).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
torchtune/training/_distributed.py	3.50%	55 Missing ⚠️
tests/torchtune/training/test_distributed.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            main    #2138       +/-   ##
==========================================
+ Coverage   9.33%   65.26%   +55.93%     
==========================================
  Files        289      334       +45     
  Lines      16959    19192     +2233     
==========================================
+ Hits        1583    12526    +10943     
+ Misses     15376     6666     -8710

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

torchtune/training/_distributed.py

…ept 2 device type and optimize memory (#142845) For destributed state dict api [migration](pytorch/torchtune#2138), make the changes here: 1. `load_from_full_model_state_dict` at TorchTune calls `set_model_state_dict` with the options on whether to have cpu_offload. Add cpu_offload at _load_model_state_dict to process to cpu if config is True 2. Change the device check as lora_finetune might hace 2 device types, accept that to be valid. 3. Some changes to optimize the memory performance: 3.1 use `.detach().clone()` instead of view directly 3.2 if local_state is not meta, copy `full_tensor[slices]` to `ret.to_local()` 4. add relative unit tests Memory performance calling from TorchTune with llama2/7B_full: 1. cpu_offload = True <img width="555" alt="Screenshot 2024-12-18 at 1 36 47 PM" src="https://github.com/user-attachments/assets/429261f5-1107-4592-b295-de3944a2614b" /> 2. cpu_offload = False <img width="555" alt="Screenshot 2024-12-18 at 1 36 52 PM" src="https://github.com/user-attachments/assets/40bf281a-236a-4218-826b-b1192a10c806" /> Pull Request resolved: #142845 Approved by: https://github.com/fegin

joecummings · 2024-12-19T20:05:13Z

torchtune/training/_distributed.py

+                sharded_param = full_tensor.new_zeros(chunk.size())
+                sharded_param[: chunk.size(0)].copy_(chunk)
+
+                # TODO: change to from_local API (need to add view support for NF4)


How can we get view support for NF4?

cc @andrewor14

Thank you for the review, we currently skip the NF4 tensor part and plan to support NF4 in the next quarter.

Looks like there's already view support for NF4Tensor? What's the error you're getting?

also cc @drisspg @weifengpy

…ersion check

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 10, 2024

joecummings added the distributed Anything related to distributed env (multi-GPU, multi-node) label Dec 10, 2024

joecummings requested a review from ebsmothers December 10, 2024 11:21

This was referenced Dec 11, 2024

Add cpu_offload at _load_model_state_dict pytorch/pytorch#140097

Closed

[state dict] Change _load_model_state_dict to enable cpu_offload, accept 2 device type and optimize memory pytorch/pytorch#142845

Closed

mori360 force-pushed the state_dict branch from b5a8e63 to d913398 Compare December 13, 2024 22:29

mori360 commented Dec 17, 2024

View reviewed changes

torchtune/training/_distributed.py Outdated Show resolved Hide resolved

mori360 changed the title ~~Mitigate distributed state dict API~~ Migrate distributed state dict API Dec 18, 2024

joecummings reviewed Dec 19, 2024

View reviewed changes

mori360 added 17 commits December 20, 2024 14:06

change load_from_full_model_state_dict

de72b9c

lint

4b959f0

correct input parameter for gather_cpu_state_dict

2ad64d2

correct test_distribtued

19cf431

correct NF4 tensor process, add todo

7e11c4f

correct load_from_full_model_state_dict witj nf4 tensor

b22c2e8

correct import error

851fcf8

correct nf4 tensor process in load_from_full_model_state_dict

9409ee8

remove print

581a65c

modify load_from_full_model_state_dict to optimize memory cost as before

1b0ba78

change load_from_full_model_state_dict

ca234c7

add trainable_only to gather_cpu_state_dict

9cd28a9

add trainable_only to gather_cpu_state_dict

bfc7668

adjust dcp api using

f368b96

adjust dcp api using

eda1b9f

add cpu_offload at load_from_full_optimizer_state_dict

db0eeed

remove is_rank_zero from load_from_full_model_state_dict, add torch v…

91e9818

…ersion check

mori360 force-pushed the state_dict branch from 1e7f47e to 91e9818 Compare December 20, 2024 22:08

mori360 added 2 commits December 20, 2024 14:29

import init_optim

11e24f1

update _USE_DISTRIBUTED_STATE_DICT_API version check

3d0d26f

mori360 marked this pull request as ready for review December 20, 2024 23:57

mori360 requested a review from joecummings December 20, 2024 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate distributed state dict API #2138

Migrate distributed state dict API #2138

mori360 commented Dec 10, 2024 •

edited

Loading

pytorch-bot bot commented Dec 10, 2024 •

edited

Loading

codecov-commenter commented Dec 11, 2024

joecummings Dec 19, 2024

mori360 Dec 20, 2024

andrewor14 Dec 20, 2024

Migrate distributed state dict API #2138

Are you sure you want to change the base?

Migrate distributed state dict API #2138

Conversation

mori360 commented Dec 10, 2024 • edited Loading

Context

Changelog

Test plan

pytorch-bot bot commented Dec 10, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2138

✅ No Failures

codecov-commenter commented Dec 11, 2024

Codecov Report

joecummings Dec 19, 2024

Choose a reason for hiding this comment

mori360 Dec 20, 2024

Choose a reason for hiding this comment

andrewor14 Dec 20, 2024

Choose a reason for hiding this comment

mori360 commented Dec 10, 2024 •

edited

Loading

pytorch-bot bot commented Dec 10, 2024 •

edited

Loading