Distributed checkpointing with mcore GPT #7116

ericharper · 2023-07-27T18:39:03Z

This PR needs mcore dist ckpt for GPT PR to be pushed before merging.

What does this PR do ?

Adds distributed checkpointing when using mcore gpt.

Distributed checkpointing enables training runs to restart automatically with different model parallel configs.
The checkpoint is saved to disk according to the sharded_state_dict:

Below is a sample of what the checkpoint looks like on disk.

common.pt                                                     model.decoder.layers.self_attention.linear_qkv.weight                           optimizer.state.exp_avg.model.embedding.word_embeddings.weight                     optimizer.state.fp32_from_fp16.model.decoder.final_layernorm.bias
metadata.json                                                 model.embedding.position_embeddings.weight                                      optimizer.state.exp_avg.model.output_layer.weight                                  optimizer.state.fp32_from_fp16.model.decoder.final_layernorm.weight
model.decoder.final_layernorm.bias                            model.embedding.word_embeddings.weight                                          optimizer.state.exp_avg_sq.model.decoder.final_layernorm.bias                      optimizer.state.fp32_from_fp16.model.decoder.layers.input_layernorm.bias
model.decoder.final_layernorm.weight                          model.output_layer.weight                                                       optimizer.state.exp_avg_sq.model.decoder.final_layernorm.weight                    optimizer.state.fp32_from_fp16.model.decoder.layers.input_layernorm.weight
model.decoder.layers.input_layernorm.bias                     optimizer.state.exp_avg.model.decoder.final_layernorm.bias                      optimizer.state.exp_avg_sq.model.decoder.layers.input_layernorm.bias               optimizer.state.fp32_from_fp16.model.decoder.layers.mlp.linear_fc1.bias
model.decoder.layers.input_layernorm.weight                   optimizer.state.exp_avg.model.decoder.final_layernorm.weight                    optimizer.state.exp_avg_sq.model.decoder.layers.input_layernorm.weight             optimizer.state.fp32_from_fp16.model.decoder.layers.mlp.linear_fc1.weight
model.decoder.layers.mlp.linear_fc1.bias                      optimizer.state.exp_avg.model.decoder.layers.input_layernorm.bias               optimizer.state.exp_avg_sq.model.decoder.layers.mlp.linear_fc1.bias                optimizer.state.fp32_from_fp16.model.decoder.layers.mlp.linear_fc2.bias
model.decoder.layers.mlp.linear_fc1._extra_state              optimizer.state.exp_avg.model.decoder.layers.input_layernorm.weight             optimizer.state.exp_avg_sq.model.decoder.layers.mlp.linear_fc1.weight              optimizer.state.fp32_from_fp16.model.decoder.layers.mlp.linear_fc2.weight
model.decoder.layers.mlp.linear_fc1.weight                    optimizer.state.exp_avg.model.decoder.layers.mlp.linear_fc1.bias                optimizer.state.exp_avg_sq.model.decoder.layers.mlp.linear_fc2.bias                optimizer.state.fp32_from_fp16.model.decoder.layers.post_self_attn_layernorm.bias
model.decoder.layers.mlp.linear_fc2.bias                      optimizer.state.exp_avg.model.decoder.layers.mlp.linear_fc1.weight              optimizer.state.exp_avg_sq.model.decoder.layers.mlp.linear_fc2.weight              optimizer.state.fp32_from_fp16.model.decoder.layers.post_self_attn_layernorm.weight
model.decoder.layers.mlp.linear_fc2._extra_state              optimizer.state.exp_avg.model.decoder.layers.mlp.linear_fc2.bias                optimizer.state.exp_avg_sq.model.decoder.layers.post_self_attn_layernorm.bias      optimizer.state.fp32_from_fp16.model.decoder.layers.self_attention.linear_proj.bias
model.decoder.layers.mlp.linear_fc2.weight                    optimizer.state.exp_avg.model.decoder.layers.mlp.linear_fc2.weight              optimizer.state.exp_avg_sq.model.decoder.layers.post_self_attn_layernorm.weight    optimizer.state.fp32_from_fp16.model.decoder.layers.self_attention.linear_proj.weight
model.decoder.layers.post_self_attn_layernorm.bias            optimizer.state.exp_avg.model.decoder.layers.post_self_attn_layernorm.bias      optimizer.state.exp_avg_sq.model.decoder.layers.self_attention.linear_proj.bias    optimizer.state.fp32_from_fp16.model.decoder.layers.self_attention.linear_qkv.bias
model.decoder.layers.post_self_attn_layernorm.weight          optimizer.state.exp_avg.model.decoder.layers.post_self_attn_layernorm.weight    optimizer.state.exp_avg_sq.model.decoder.layers.self_attention.linear_proj.weight  optimizer.state.fp32_from_fp16.model.decoder.layers.self_attention.linear_qkv.weight
model.decoder.layers.self_attention.linear_proj.bias          optimizer.state.exp_avg.model.decoder.layers.self_attention.linear_proj.bias    optimizer.state.exp_avg_sq.model.decoder.layers.self_attention.linear_qkv.bias     optimizer.state.fp32_from_fp16.model.embedding.position_embeddings.weight
model.decoder.layers.self_attention.linear_proj._extra_state  optimizer.state.exp_avg.model.decoder.layers.self_attention.linear_proj.weight  optimizer.state.exp_avg_sq.model.decoder.layers.self_attention.linear_qkv.weight   optimizer.state.fp32_from_fp16.model.embedding.word_embeddings.weight
model.decoder.layers.self_attention.linear_proj.weight        optimizer.state.exp_avg.model.decoder.layers.self_attention.linear_qkv.bias     optimizer.state.exp_avg_sq.model.embedding.position_embeddings.weight              optimizer.state.fp32_from_fp16.model.output_layer.weight
model.decoder.layers.self_attention.linear_qkv.bias           optimizer.state.exp_avg.model.decoder.layers.self_attention.linear_qkv.weight   optimizer.state.exp_avg_sq.model.embedding.word_embeddings.weight
model.decoder.layers.self_attention.linear_qkv._extra_state   optimizer.state.exp_avg.model.embedding.position_embeddings.weight              optimizer.state.exp_avg_sq.model.output_layer.weight

Then inside a module directory we have the sharded tensor:

ls model.decoder.layers.mlp.linear_fc1.weight/
0.0.0  1.0.0  10.0.0  11.0.0  12.0.0  13.0.0  14.0.0  15.0.0  2.0.0  3.0.0  4.0.0  5.0.0  6.0.0  7.0.0  8.0.0  9.0.0

To implement distributed checkpointing for a model, the sharded_state_dict has to be defined.
This is done in megatron core so that in NeMo, if the module is from mcore, we only have to call module.sharded_state_dict().

Collection: NLP

Usage

Usage is automatic when using mcore:

model.mcore_gpt=True

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>

…t_path

Signed-off-by: ericharper <[email protected]>

…t_path

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>

…t_path

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>

Signed-off-by: eharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: eharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: eharper <[email protected]>

* Integrate new DistOpt state dict * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change optimizer fp32_param key --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]>

Signed-off-by: eharper <[email protected]>

Jenkinsfile

for more information, see https://pre-commit.ci

examples/nlp/language_modeling/megatron_gpt_pretraining.py

nemo/core/optim/distributed_adam.py

+from megatron.core.dist_checkpointing.optimizer import (
+    get_param_id_to_sharded_param_map,
+    make_sharded_optimizer_tensor,
+    optim_state_to_sharding_state,
+)


aklife97

LGTM, thank you!

* start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * add TransformerConfig Signed-off-by: ericharper <[email protected]> * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove imports Signed-off-by: ericharper <[email protected]> * revert Signed-off-by: ericharper <[email protected]> * remove import Signed-off-by: ericharper <[email protected]> * small clean up Signed-off-by: ericharper <[email protected]> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <[email protected]> * add config obj to flash attention tests Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to test Signed-off-by: ericharper <[email protected]> * get hidden_size from config Signed-off-by: ericharper <[email protected]> * add try except Signed-off-by: ericharper <[email protected]> * use default Signed-off-by: ericharper <[email protected]> * update config with hidden size Signed-off-by: ericharper <[email protected]> * remove arg Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * comment out jenkins test Signed-off-by: ericharper <[email protected]> * revert import Signed-off-by: ericharper <[email protected]> * build transformer config Signed-off-by: ericharper <[email protected]> * add model to provider func Signed-off-by: ericharper <[email protected]> * update forward and float16 wrapper Signed-off-by: ericharper <[email protected]> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <[email protected]> * set virtual rank Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (NVIDIA#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <[email protected]> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> * revert Signed-off-by: ericharper <[email protected]> * update strategy and exp_manager Signed-off-by: ericharper <[email protected]> * update model checkpoint Signed-off-by: ericharper <[email protected]> * update megatron gpt model Signed-off-by: ericharper <[email protected]> * correct var Signed-off-by: ericharper <[email protected]> * check for mcore gpt and use gpt model list Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove model prefix Signed-off-by: ericharper <[email protected]> * setup te tp groups Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert Signed-off-by: eharper <[email protected]> * revert Signed-off-by: eharper <[email protected]> * add default Signed-off-by: eharper <[email protected]> * add default Signed-off-by: eharper <[email protected]> * revert Signed-off-by: eharper <[email protected]> * update sharded state dict for interleaved Signed-off-by: eharper <[email protected]> * update load for interleaved Signed-off-by: eharper <[email protected]> * check sharded state dict is nonempty Signed-off-by: eharper <[email protected]> * remove import Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert comment Signed-off-by: eharper <[email protected]> * inject before checking legacy ckpt Signed-off-by: eharper <[email protected]> * revert Signed-off-by: eharper <[email protected]> * pop arg for now Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert jenkins change Signed-off-by: eharper <[email protected]> * remove device state_dict Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reduce batch size for max steps Signed-off-by: eharper <[email protected]> * update megatron core commit Signed-off-by: eharper <[email protected]> * Integrate dist ckpt with new DistOpt state dict v2 (NVIDIA#7281) * Integrate new DistOpt state dict * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change optimizer fp32_param key --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> * update apex commit Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: jasonwan <[email protected]> Signed-off-by: eharper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jason Wang <[email protected]> Co-authored-by: mikolajblaz <[email protected]>

ericharper and others added 30 commits June 7, 2023 12:00

start adding gpt from megatron core path

29d2e74

Signed-off-by: ericharper <[email protected]>

set model parallel config

f7671dd

Signed-off-by: ericharper <[email protected]>

pull main

910ce35

Signed-off-by: ericharper <[email protected]>

use model parallel config object

da87792

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

59a008b

for more information, see https://pre-commit.ci

update args

7d7a4c3

Signed-off-by: ericharper <[email protected]>

pull

2f1bced

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

066a0df

for more information, see https://pre-commit.ci

set vp size to none if it is 1

a114cb8

Signed-off-by: ericharper <[email protected]>

set vp size to none if it is 1

7992ddd

Signed-off-by: ericharper <[email protected]>

Merge branch 'mcore_gpt_path' of github.com:NVIDIA/NeMo into mcore_gp…

c693224

…t_path

resolve conflict

f8960e2

Signed-off-by: ericharper <[email protected]>

Merge branch 'main' into mcore_gpt_path

5b3e877

add TransformerConfig

71c9d04

Signed-off-by: ericharper <[email protected]>

Merge branch 'main' of github.com:NVIDIA/NeMo into mcore_gpt_path

2e03e5e

start updating to TransformerConfig

b1211df

Signed-off-by: ericharper <[email protected]>

add todo

281d115

Signed-off-by: ericharper <[email protected]>

revert to model parallel config

4c2768d

Signed-off-by: ericharper <[email protected]>

add hidden_size to model_parallel_config

a65a2ca

Signed-off-by: ericharper <[email protected]>

Merge branch 'mcore_gpt_path' of github.com:NVIDIA/NeMo into mcore_gp…

73e76d2

…t_path

resolve conflicts

09ab5b0

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

4a557ec

for more information, see https://pre-commit.ci

remove imports

efc708e

Signed-off-by: ericharper <[email protected]>

revert

9c71633

Signed-off-by: ericharper <[email protected]>

Merge branch 'mcore_gpt_path' of github.com:NVIDIA/NeMo into mcore_gp…

3e47d88

…t_path

remove import

de4c6f6

Signed-off-by: ericharper <[email protected]>

small clean up

4b281e4

Signed-off-by: ericharper <[email protected]>

update hidden size in peft base model, add mcore commit to jenkins

5309292

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0129f82

for more information, see https://pre-commit.ci

update module args

d9c4a35

Signed-off-by: ericharper <[email protected]>

ericharper and others added 8 commits August 21, 2023 12:01

revert

8fa76f2

Signed-off-by: eharper <[email protected]>

Merge branch 'main' into mcore_gpt_dist_ckpt

41584e5

Merge branch 'main' into mcore_gpt_dist_ckpt

af5b6eb

Merge branch 'main' into mcore_gpt_dist_ckpt

7fa2acc

pop arg for now

7538816

Signed-off-by: eharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

4353d68

for more information, see https://pre-commit.ci

Merge branch 'main' into mcore_gpt_dist_ckpt

1aa58bc

revert jenkins change

15bc476

Signed-off-by: eharper <[email protected]>

github-actions bot removed the CI label Aug 23, 2023

ericharper and others added 5 commits August 23, 2023 18:58

Merge branch 'main' into mcore_gpt_dist_ckpt

3664f08

remove device state_dict

6ee93ec

Signed-off-by: eharper <[email protected]>

Merge branch 'main' into mcore_gpt_dist_ckpt

a4e9a2f

[pre-commit.ci] auto fixes from pre-commit.com hooks

aaf6214

for more information, see https://pre-commit.ci

reduce batch size for max steps

c939257

Signed-off-by: eharper <[email protected]>

github-actions bot added the CI label Aug 24, 2023

update megatron core commit

4487519

Signed-off-by: eharper <[email protected]>

ericharper marked this pull request as ready for review August 25, 2023 00:49

mikolajblaz and others added 3 commits August 25, 2023 17:56

Merge branch 'main' into mcore_gpt_dist_ckpt

1771316

update apex commit

f4a5e01

Signed-off-by: eharper <[email protected]>

aklife97 reviewed Aug 26, 2023

View reviewed changes

Jenkinsfile Show resolved Hide resolved

[pre-commit.ci] auto fixes from pre-commit.com hooks

497e934

for more information, see https://pre-commit.ci

aklife97 reviewed Aug 26, 2023

View reviewed changes

examples/nlp/language_modeling/megatron_gpt_pretraining.py Show resolved Hide resolved

github-advanced-security bot found potential problems Aug 26, 2023

View reviewed changes

aklife97 approved these changes Aug 28, 2023

View reviewed changes

ericharper merged commit d6357fd into main Aug 28, 2023
15 checks passed

ericharper deleted the mcore_gpt_dist_ckpt branch August 28, 2023 21:52

This was referenced Aug 30, 2023

Return distributed optimizer checkpoint on all ranks NVIDIA/apex#1719

Closed

Use distributed optimizer support for multiple dtypes #7359

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed checkpointing with mcore GPT #7116

Distributed checkpointing with mcore GPT #7116

ericharper commented Jul 27, 2023 •

edited

Loading

aklife97 left a comment

Distributed checkpointing with mcore GPT #7116

Distributed checkpointing with mcore GPT #7116

Conversation

ericharper commented Jul 27, 2023 • edited Loading

What does this PR do ?

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

aklife97 left a comment

Choose a reason for hiding this comment

ericharper commented Jul 27, 2023 •

edited

Loading