[Storage Cleaner] Speed up unsharding of some legacy checkpoints #488

2015aroras · 2024-03-06T23:42:14Z

This PR changes the unsharding mechanism of legacy checkpoints to use processes and shared memory instead of threads. In one case where the world size was 1024, this implementation brought the unsharding time down from 6 hours to 30 minutes. This implementation is slower than the old one at smaller scales, but that is ok.

An option was to keep the old mechanism around in code too, but since we are trying to get rid of legacy sharded checkpoints it doesn't seem worth to keep that code around.

epwalsh

Very cool! Just a couple questions

epwalsh · 2024-03-07T00:53:14Z

olmo/checkpoint.py

+        if rank_size == 0:
+            return
+
+        temp: np.ndarray = torch.zeros(rank_size, dtype=shard0_md.tensor_properties.dtype).numpy()


What's the purpose of this temp array? If it's just to get the number of bytes, you should be able to infer that from the data type and size.

Just type and number of bytes. I've changed the code to assume fp32 (c58f4b4). I already know they are not bf16 at least because .numpy() fails for bf16.

yea when we train in bf16 (or fp16), the main copy of model weights is always fp32

epwalsh · 2024-03-07T00:55:40Z

olmo/checkpoint.py

+        temp: np.ndarray = torch.zeros(1, dtype=shard0_md.tensor_properties.dtype).numpy()
+        numpy_type = temp.dtype


Similar question here, but looks like it's just to get the data type? It's probably reasonable to assume FP32.

c58f4b4 Changed to assume fp32

epwalsh

LGTM

2015aroras added 3 commits March 6, 2024 15:24

Add methods for copying sharded tensors to shared memory

3428643

Add methods for populating state using sharded tensors in shared memory

05e5210

Unshard legacy checkpoints using shared memory

35168c6

2015aroras requested review from dirkgr and epwalsh March 6, 2024 23:42

epwalsh reviewed Mar 7, 2024

View reviewed changes

2015aroras added 3 commits March 6, 2024 17:36

Address PR comment about temp array

c58f4b4

Update changelog

bf78a9c

Merge branch 'main' into shanea/optimize-unsharding-2

158da6c

epwalsh approved these changes Mar 7, 2024

View reviewed changes

2015aroras merged commit 752353b into main Mar 7, 2024
11 checks passed

2015aroras deleted the shanea/optimize-unsharding-2 branch March 7, 2024 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Storage Cleaner] Speed up unsharding of some legacy checkpoints #488

[Storage Cleaner] Speed up unsharding of some legacy checkpoints #488

2015aroras commented Mar 6, 2024

epwalsh left a comment

epwalsh Mar 7, 2024

2015aroras Mar 7, 2024 •

edited

Loading

epwalsh Mar 7, 2024

epwalsh Mar 7, 2024

2015aroras Mar 7, 2024

epwalsh left a comment

		temp: np.ndarray = torch.zeros(1, dtype=shard0_md.tensor_properties.dtype).numpy()
		numpy_type = temp.dtype

[Storage Cleaner] Speed up unsharding of some legacy checkpoints #488

[Storage Cleaner] Speed up unsharding of some legacy checkpoints #488

Conversation

2015aroras commented Mar 6, 2024

epwalsh left a comment

Choose a reason for hiding this comment

epwalsh Mar 7, 2024

Choose a reason for hiding this comment

2015aroras Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

epwalsh Mar 7, 2024

Choose a reason for hiding this comment

epwalsh Mar 7, 2024

Choose a reason for hiding this comment

2015aroras Mar 7, 2024

Choose a reason for hiding this comment

epwalsh left a comment

Choose a reason for hiding this comment

2015aroras Mar 7, 2024 •

edited

Loading