Checkpoint conversion tools #14

tjruwase · 2021-09-06T15:13:19Z

Tools for converting checkpoints.

tjruwase · 2021-09-06T15:13:50Z

2) Reshape TP and PP degrees

stas00 · 2021-09-10T20:50:37Z

OK, to complete the conversion and make the model usable with "--finetune" the missing bits are:

sd["args"].tensor_model_parallel_size = 1
sd["args"].pipeline_model_parallel_size = 1
sd["args"].consumed_train_samples = 0
sd["args"].consumed_valid_samples = 0

the first 2 of course need to be adjusted to the target tp/pp sizes.

the last 2 need to be reset otherwise meg tries to resume training from some really high number of a sample which should be 0 instead.

I haven't quite figured out how to solve the padded_vocab_size being larger than the vocab. Probably needs to be truncated to the vocab size before saving the embeddings.

the workaround is to use the actual padded vocab size when finetuning, i.e.:

--make-vocab-size-divisible-by 50688

for when the padded vocab ends up being 50688.

stas00 · 2021-09-10T21:06:31Z

Then the files layout, clearly Meg-LM expects this layout:

iter_0000001/mp_rank_00_000/model_optim_rng.pt
latest_checkpointed_iteration.txt

whereas in the Meg-DS tree it wants:

iter_0000001/mp_rank_00/model_optim_rng.pt
latest_checkpointed_iteration.txt

no _000 at the end. I wonder if just creating one and adding a symlink to the other would do the trick. (this is with tp=1/pp=1) - at least this is how I'm overcoming this while testing with both trees.

The first segment of the path is:

directory = 'iter_{:07d}'.format(iteration)

in the meg code.

Additionally, we could probably convert global_step37876 to iter_0037876 to help the user know which iteration the training is coming from. Rather than iter=1, incidentally you can then save it in the dict as iteration=37876 which it wants anyway for non-finetune.

tjruwase · 2021-09-10T21:07:20Z

the first 2 of course need to be adjusted to the target tp/pp sizes.

the last 2 need to be reset otherwise meg tries to resume training from some really high number of a sample which should be 0 instead.

Thanks for the feedback. I can update deepspeed_to_megatron.py for the first 2 to fix an inconsistency in the checkpoint state. However, I am unsure where to handle the last 2 since it would prevent the converted checkpoint from being used for continued training. So perhaps the finetuning script should handle the last 2. What do you think?

stas00 · 2021-09-10T21:11:43Z

but this checkpoint can't be loaded for continued training at the moment. e.g. it lacks the iteration entry, and Meg crashes w/o --finetune because of that.

I'm not sure how you'd change the finetune script to ignore consumed_train_samples because it also checkpoints and should be able to resume from its consumed_train_samples ongoing record.

Perhaps we have 2 different modes here:

reshape the checkpoint, but presume continued training
convert for a release purpose, resetting some counters.

stas00 · 2021-09-10T21:22:33Z

Found one more culprit - remember how Jared's script was asking for a megatron clone path when doing the conversion? It proved to be essential, since if we use the default Meg-DS when converting, it then fails to torch.load when attempting to use Meg-LM:

Traceback (most recent call last):
  File "pretrain_gpt.py", line 124, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/mnt/nvme1/code/huggingface/Megatron-LM/megatron/training.py", line 112, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  File "/mnt/nvme1/code/huggingface/Megatron-LM/megatron/training.py", line 325, in setup_model_and_optimizer
    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  File "/mnt/nvme1/code/huggingface/Megatron-LM/megatron/checkpointing.py", line 314, in load_checkpoint
    state_dict = torch.load(checkpoint_name, map_location='cpu')
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/serialization.py", line 607, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/serialization.py", line 882, in _load
    result = unpickler.load()
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/serialization.py", line 875, in find_class
    return super().find_class(mod_name, name)
ModuleNotFoundError: No module named 'megatron.enums'

Things have changed in Meg-DS and it now it can't find megatron.enums. In bigscience Meg-LM this is megatron.model.enums instead.

grr, this appears really tricky, now that the codebases are starting to diverge.

I can't even load the bigscience meg-ds checkpoint using Med-DS codebase in PYTHONPATH

OK solved this by adding both clones to PYTHONPATH explicitly:

PYTHONPATH=/hf/Megatron-DeepSpeed-master:/hf/Megatron-DeepSpeed-microsoft python tools/convert_checkpoint/deepspeed_to_megatron.py ...

but it still doesn't work when then I try to train with the Megatron-LM tree.

I'm trying to ask to restore that.

bigscience-workshop/Megatron-DeepSpeed#7 (comment)

Unless you have some bright ideas how to not to pickle structures that may be lacking in the target?

tjruwase · 2021-09-10T21:23:07Z

Additionally, we could probably convert global_step37876 to iter_0037876 to help the user know which iteration the training is coming from. Rather than iter=1, incidentally you can then save it in the dict as iteration=37876 which it wants anyway for non-finetune.

I am a bit confused by this. Are you seeing iter=1 in the converted checkpoint?

stas00 · 2021-09-10T21:25:54Z

I think it perhaps expects the top-level iteration key and not args.iteration? remember how I mentioned there were 3 keys it expects? So I think it wants:

ITERATION_KEY = 'iteration'
...
    checkpoint_sd[ITERATION_KEY] = iteration

in your code

Are you seeing iter=1 in the converted checkpoint?

I see args.iteration=94743 but the checkpoint sub-dir is global_step97500 so it's odd. Must be a lack of sync between Meg and Meg-DS.

tools/convert_checkpoint/deepspeed_to_megatron.py

tjruwase · 2021-09-10T23:01:05Z

think it perhaps expects the top-level iteration key and not args.iteration? remember how I mentioned there were 3 keys it expects? So I think it wants:

It would be great to get some clarity on which iteration to use here.

Also should it be

python checkpoint_sd['args'][ITERATION_KEY] or
python checkpoint_sd[ITERATION_KEY]?

stas00 · 2021-09-10T23:32:41Z

think it perhaps expects the top-level iteration key and not args.iteration? remember how I mentioned there were 3 keys it expects? So I think it wants:

It would be great to get some clarity on which iteration to use here.

the problem is that it seems that Meg-DS doesn't update Meg's native iteration variable soon enough and so the saved checkpoint reports the outdated lower iteration. So most likely the fix is needed inside Meg-DS to at the very least sync Meg's native iteration variable with Meg-DS value of the same.

Also should it be

python checkpoint_sd['args'][ITERATION_KEY] or

python checkpoint_sd[ITERATION_KEY]?

The latter.

But it seems to be pointless, because once I add the missing key, it then fails with:

 loading checkpoint from /hf/Megatron-DeepSpeed-master/data/1B3-PP4-TP4-Meg at iteration 1
 checkpoint version 3.0
Unable to load optimizer from checkpoint /hf/Megatron-DeepSpeed-master/data/1B3-PP4-TP4-Meg/iter_0000001/mp_rank_00/model_optim_rng.pt. Specify --no-load-optim or --finetune to prevent attempting to load the optimizer state, exiting ...

we can't manifest optimizer states for Meg-LM out of nowhere, so it appears that after the conversion only inference or finetuning is possible. In which case it's probably pointless to try to keep consumed_train_samples and its friend.

Perhaps let's for now handle just the clear case of inference/finetuning with an assumption that finetuning will require a different dataset?

I think it's only when we reshape the checkpoint as discussed a few days later to support changing the degree of TP, is that when we would try to preserve everything, but that's when saving it from Meg-DS back to Meg-DS.

stas00 · 2021-09-10T23:35:52Z

So I think the only remaining thing to address (other than embeddings) is: #14 (comment)

And let's set:

checkpoint_sd[ITERATION_KEY]=xxx

to whatever global_stepxxx points to. and match iter_0000xxx to it.

sorry, brainstorming here... but then what if the input checkpoint isn't named /path/to/global_stepxxx?

stas00 · 2021-09-10T23:47:51Z

In another checkpoint I was given the discrepancy is 2x,

Meg-DS file is global_step37876 but args.iteration=18931 which is ~0.5 of the former. Weird. Looks like the 2 counters aren't in sync at all.

This tells me that args.iteration=18931 is incorrect and somehow the DS integration forgets to update it, and that global_step37876 is the correct iteration since it matches the log files.

stas00 · 2021-09-11T00:01:18Z

OK, I have figured this one out. You were getting the wrong iteration, you need this one:

f = "global_step37876/mp_rank_00_model_states.pt"
sd = torch.load(f)
sd["iteration"]

this is the real iteration, but the original code args.iteration is whatever was the iteration at the start of training. Does it make sense?

sd["iteration"] # right
sd["args"].iteration # wrong

stas00 · 2021-09-14T02:21:39Z

@tjruwase, 2 more things that I discovered are different from the checkpoint generated by Meg-LM natively. So these need to be changed as demonstrated:

- embeddings["word_embeddings.weight"]
+ embeddings["word_embeddings"]["weight"]

and:

+ embeddings["position_embeddings.weight"]
- embeddings["position_embeddings"]["weight"]

So I think after this fix, the resulting checkpoint will be matching the native one.

Plus the resulting file structure: #14 (comment)

and then it's good to be merged.

I did multiple tests on the final stage meg2hf and the conversion appears to be correct.

Iteration folder latest checkpoint version file

stas00 · 2021-09-18T01:20:54Z

This is good to merge, @tjruwase! Thank you!

stas00 · 2021-09-20T19:06:10Z

@tjruwase, I made a small change to your work to separate the creation of the checkpoint and saving it, so that I could re-use it to create the HF transformers checkpoint on the fly.

Also made the PP/TP size 1 by default, since in the HF case it's always that for now.

If it looks acceptable to you perhaps let's merge this back into your master tree? If so please cherry-pick these 2:

Thank you!

stas00 · 2021-09-22T19:25:02Z

here you go - I added 2 more where I made the scripts executable ;)

git checkout -b convert-meg-ds-to-hf
git remote add other https://github.com/bigscience-workshop/Megatron-DeepSpeed
git fetch other
git cherry-pick bea5ded
git cherry-pick d6c2a80
git cherry-pick 26f18b5
git cherry-pick 2f662e8
git push --set-upstream origin convert-meg-ds-to-hf

Use generic inference

Sunspot frameworks tests

tjruwase added 2 commits September 6, 2021 14:54

Checkpoint conversion tools

3afaf2d

Fix formatting

6f6b342

tjruwase requested a review from ShadenSmith September 6, 2021 15:13

tjruwase added 4 commits September 9, 2021 01:13

1) Provide args in converted checkpoint

0fa9543

2) Reshape TP and PP degrees

Fix typo

4ff8b62

Fix link

03ddc9c

Tweak tag

2758f21

tjruwase added 2 commits September 10, 2021 21:12

Fix converted TP and PP sizes

872d63c

For release mode

f6d2fa4

stas00 reviewed Sep 10, 2021

View reviewed changes

tools/convert_checkpoint/deepspeed_to_megatron.py Show resolved Hide resolved

stas00 mentioned this pull request Sep 10, 2021

Implement rotary embeddings bigscience-workshop/Megatron-DeepSpeed#7

Merged

4 tasks

Update README

ca0ce73

Nested embedding dicts

32f2a7f

Iteration folder latest checkpoint version file

tjruwase merged commit 1a74a05 into main Sep 18, 2021

stas00 mentioned this pull request Sep 20, 2021

add direct meg-ds to hf format script bigscience-workshop/Megatron-DeepSpeed#110

Merged

jeffra pushed a commit that referenced this pull request Jan 18, 2022

Merge pull request #14 from awan-10/use-generic-inference

4cffda5

Use generic inference

wangguojim mentioned this pull request Aug 11, 2023

device-side assert triggered #207

Open

saforem2 referenced this pull request in saforem2/Megatron-DeepSpeed Oct 11, 2024

Merge pull request #14 from argonne-lcf/sunspot-frameworks-tests

4dd51dd

Sunspot frameworks tests

Checkpoint conversion tools #14

Checkpoint conversion tools #14

Uh oh!

Conversation

tjruwase commented Sep 6, 2021

Uh oh!

tjruwase commented Sep 6, 2021

Uh oh!

stas00 commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjruwase commented Sep 10, 2021

Uh oh!

stas00 commented Sep 10, 2021

Uh oh!

stas00 commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjruwase commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tjruwase commented Sep 10, 2021

Uh oh!

stas00 commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 18, 2021

Uh oh!

stas00 commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stas00 commented Sep 10, 2021 •

edited

Loading

stas00 commented Sep 10, 2021 •

edited

Loading

stas00 commented Sep 10, 2021 •

edited

Loading

tjruwase commented Sep 10, 2021 •

edited

Loading

stas00 commented Sep 10, 2021 •

edited

Loading

stas00 commented Sep 10, 2021 •

edited

Loading

stas00 commented Sep 10, 2021 •

edited

Loading

stas00 commented Sep 10, 2021 •

edited

Loading

stas00 commented Sep 11, 2021 •

edited

Loading

stas00 commented Sep 14, 2021 •

edited

Loading

stas00 commented Sep 20, 2021 •

edited

Loading