Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
138 commits
Select commit Hold shift + click to select a range
f1b9e8d
docs: Add `ALCF/notes/universal_checkpoint_bug.md`
saforem2 Dec 29, 2024
246e82b
Update universal_checkpoint_bug.md
saforem2 Dec 29, 2024
439c777
Merge pull request #73 from argonne-lcf/docs-ucp-bug
saforem2 Dec 29, 2024
03da571
feat: Add `ALCF/examples/finetune_llama3/*`
saforem2 Jan 14, 2025
19bdff0
docs: Update `ALCF/examples/finetune_llama3/*`
saforem2 Jan 15, 2025
e2cb209
chore: Update `tools/hf2megads_weight_converter.py`
saforem2 Jan 15, 2025
a868788
feat: Add `ALCF/examples/finetune_llama3p2_1B/*`
saforem2 Jan 15, 2025
babef03
feat: Update `ALCF/examples/finetune_llama3/*`
saforem2 Jan 15, 2025
7727f93
Update README.md
saforem2 Jan 15, 2025
13666a1
Remove redundant `ALCF/examples/finetune_llama3p2_1B/*`
saforem2 Jan 15, 2025
6b5eed5
chore: Add `DummyOptimizer` to `tools/hf2megads_weight_converter.py`
saforem2 Jan 16, 2025
2f9e19d
fix: `NO_FLASH_ATTN` on Polaris in `ALCF/helpers.sh`
saforem2 Jan 16, 2025
0636aea
Add {sunspot, sophia} in `ALCF/examples/finetune_llama3/*`
saforem2 Jan 17, 2025
b800277
Merge branch 'main' into finetune-llama3
saforem2 Jan 18, 2025
adeca53
fix: Call `set_ccl_vars_on_aurora` only if `WORLD_SIZE > 1`
saforem2 Jan 19, 2025
4d0077c
Update README.md
saforem2 Jan 27, 2025
3af7eb4
Merge pull request #76 from argonne-lcf/saforem2-patch-2
saforem2 Jan 27, 2025
8098a70
Merge pull request #75 from argonne-lcf/fix-single-node
saforem2 Jan 28, 2025
e7990d5
added adopt optimizer
Jan 28, 2025
0948f84
adopt optimizer
Jan 28, 2025
1c04f64
fix: Resolve merge commit
saforem2 Jan 28, 2025
c1f99b9
chore: Update Llama FT
saforem2 Jan 28, 2025
3991f25
chore: Update `megatron/data/prompt_dataset.py`
saforem2 Jan 28, 2025
101d0ed
feat: Add `ALCF/examples/checkpoint_conversion/*`
saforem2 Jan 31, 2025
10903ef
docs: Update `ALCF/examples/checkpoint_conversion/README.md`
saforem2 Jan 31, 2025
19b3b74
Update README.md
saforem2 Mar 12, 2025
0dcc101
docs: Add `ALCF/notes/deprecated.md`
saforem2 Mar 12, 2025
a3424b6
docs: Update `ALCF/README.md`
saforem2 Mar 12, 2025
bf50938
docs: Update `ALCF/notes/deprecated.md`
saforem2 Mar 12, 2025
b9258f5
Merge branch 'main' into finetune-llama3
saforem2 Mar 12, 2025
cf49054
Merge pull request #74 from argonne-lcf/finetune-llama3
saforem2 Mar 12, 2025
6b8f092
Merge branch 'main' into saforem2-patch-2
saforem2 Mar 12, 2025
aedc2c2
Merge pull request #78 from argonne-lcf/saforem2-patch-2
saforem2 Mar 12, 2025
6fcf7d5
Update README.md
saforem2 Mar 12, 2025
39784a0
Merge pull request #79 from argonne-lcf/saforem2-patch-3
saforem2 Mar 12, 2025
a5fea86
added muon
Mar 20, 2025
f145195
added muon optimizer
Mar 24, 2025
8bfee6f
added muon optimizer
Mar 24, 2025
444af68
chore: formatting in `megatron/model/__init__.py`
saforem2 Mar 26, 2025
9c472b4
chore: formatting `megatron/utils.py`
saforem2 Mar 26, 2025
11f1433
Merge branch 'main' into lb-optimizers
saforem2 Mar 26, 2025
267d650
added infinite schedulers
Apr 21, 2025
6b6c63d
fix: Fix imports in `pretrain_gpt_alcf.py`
saforem2 Apr 22, 2025
7ee03d7
chore: Update `ALCF/helpers.sh`
saforem2 Apr 23, 2025
3c3eb45
feat: Add `train_alcf.sh`
saforem2 Apr 23, 2025
012800e
Merge branch 'update-ALCF-helpers' into fix/pretrain-gpt-alcf-imports
saforem2 Apr 23, 2025
68b53d9
Merge pull request #84 from argonne-lcf/fix/pretrain-gpt-alcf-imports
saforem2 Apr 23, 2025
f40b7e5
fix: Update `train_alcf.sh`
saforem2 Apr 23, 2025
61f13cf
Merge pull request #82 from argonne-lcf/fix/pretrain-gpt-alcf-imports
saforem2 Apr 30, 2025
85ac175
Merge branch 'main' into update-ALCF-helpers
saforem2 Apr 30, 2025
952940a
fix: Fix `ALCF/helpers.sh`
saforem2 Apr 30, 2025
b136aa5
fix: Replace `eval` with `bash -c` in `train_alcf.sh`
saforem2 Apr 30, 2025
9a3f6bd
chore: Fix unset `CFLAGS` in `ALCF/helpers.sh`
saforem2 May 1, 2025
61ce1c5
chore: Update `train_alcf.sh`
saforem2 May 5, 2025
ba01f41
added lr finder logic
May 6, 2025
d7e12df
fix: Remove call to `set_ccl_vars_on_aurora` in `ALCF/helpers.sh`
saforem2 May 6, 2025
e8efc70
chore: Clean up `train_alcf.sh`
saforem2 May 6, 2025
4669e65
Merge pull request #83 from argonne-lcf/update-ALCF-helpers
saforem2 May 6, 2025
fa73d59
feat: Resolve conflicts in `train_alcf.sh`
saforem2 May 19, 2025
ac0df1d
chore: Update `train_alcf.sh`
saforem2 Jun 16, 2025
48f300f
feat: Add `ALCF/notes/AuroraGPT-70B.md`
saforem2 Jun 16, 2025
5cd5da9
chore: Update `pretrain_gpt_alcf.py`
saforem2 Jun 16, 2025
be64b02
chore: Update `megatron/training.py`
saforem2 Jun 16, 2025
08ab7ee
chore: Update `train_alcf.sh`
saforem2 Jun 16, 2025
8c13fef
chore: Remove tensoboard tracking in `megatron/training_log.py`
saforem2 Jun 16, 2025
3e174cd
chore: Update `ALCF/helpers.sh`
saforem2 Jun 16, 2025
cb939d7
Merge pull request #86 from argonne-lcf/saforem2/dev
saforem2 Jun 16, 2025
ea99a52
Merge branch 'main' into saforem2/training
saforem2 Jun 16, 2025
0ddd003
feat: Remove `--use-mics` flag wen using `ZeRO` 3
saforem2 Jun 17, 2025
43794c1
docs: Update `ALCF/notes/AuroraGPT-70B.md`
saforem2 Jun 17, 2025
351cfeb
docs: Update `ALCF/notes/AuroraGPT-70B.md`
saforem2 Jun 17, 2025
e7157e4
docs: Update `ALCF/notes/AuroraGPT-70B.md`
saforem2 Jun 17, 2025
c9e5879
chore: Update `pretrain_gpt_alcf.py`
saforem2 Jun 18, 2025
b85d33b
chore: Update `megatron/training_log.py`
saforem2 Jun 18, 2025
82e0b2e
Merge pull request #87 from argonne-lcf/saforem2/training
saforem2 Jun 18, 2025
a9620b9
chore: Format `megatron/data/*`
saforem2 Jun 18, 2025
7af703b
chore: Format `megatron/text_generation/*`
saforem2 Jun 18, 2025
103467d
chore: Format `megatron/optimizer/*`
saforem2 Jun 18, 2025
e37ab7d
chore: Format `megatron/mpu/*`
saforem2 Jun 18, 2025
b22a2e1
chore: Format `megatron/tokenizer/*`
saforem2 Jun 18, 2025
ed02e5f
chore: Format `megatron/model/*`
saforem2 Jun 18, 2025
40e942a
chore: Format `megatron/core/*`
saforem2 Jun 18, 2025
f4cd2e1
chore: Format `megatron/*.py`
saforem2 Jun 18, 2025
3a6ffec
chore: Format `megatron/fused_kernels/*.py`
saforem2 Jun 18, 2025
c030485
infinite schedulers and learning rate finder
Jul 3, 2025
fdb0965
micromamba
Jul 11, 2025
a6b7bd2
emb_init branch changes added
Jul 11, 2025
8800bcc
cleaned up
Jul 11, 2025
5464ebf
Merge branch 'lb-optimizers' into saforem2/fix-formatting
saforem2 Jul 12, 2025
ffaafd7
fix: missing changes from `lb-optimizers` <- `saforem2/fix-formatting`
saforem2 Jul 12, 2025
c18e3ad
fix: Fix missing comma in `core/models/gpt/gpt_embedding.py`
saforem2 Jul 12, 2025
e9ca6d0
chore: Update `pretrain_gpt_alcf.py`
saforem2 Jul 12, 2025
a1d5588
chore: Update `ALCF/helpers.sh`
saforem2 Jul 12, 2025
3f668a1
feat: Add `ALCF/notes/debugging.md`
saforem2 Jul 14, 2025
3dc22e9
Update debugging.md
saforem2 Jul 14, 2025
386863b
chore: Update `train_alcf.sh`
saforem2 Jul 15, 2025
82b491f
chore: Update `megatron/training_log.py`
saforem2 Jul 15, 2025
9ffebe1
chore: Update `megatron/training.py`
saforem2 Jul 15, 2025
b4849ae
chore: Update `megatron/optimizer_param_scheduler.py`
saforem2 Jul 15, 2025
683386a
chore: Update `megatron/optimizer/muon.py`
saforem2 Jul 15, 2025
6165561
chore: Update `megatron/optimizer/adopt.py`
saforem2 Jul 15, 2025
558322d
chore: Update `megatron/optimizer/__init__.py`
saforem2 Jul 15, 2025
1656e98
chore: Update `megatron/core/transformer/transformer_config.py`
saforem2 Jul 15, 2025
102d797
chore: Update `megatron/checkpointing.py`
saforem2 Jul 15, 2025
f7f397b
chore: Update `megatron/arguments.py`
saforem2 Jul 15, 2025
c369de6
chore: Update `ALCF/helpers.sh`
saforem2 Jul 15, 2025
99dacb5
chore: Update `ALCF/helpers.sh`
saforem2 Jul 15, 2025
6b08a2d
docs: Update `ALCF/notes/debugging.md`
saforem2 Jul 15, 2025
984cbb1
chore: Update `megatron/training_log.py`
saforem2 Jul 15, 2025
ebb1899
chore: Update `megatron/timers.py`
saforem2 Jul 15, 2025
e3b0398
fixed infinite schedulers bugs and dshampoo name in arguments
Jul 22, 2025
1508ff5
chore: Update `ALCF/README.md`
saforem2 Aug 7, 2025
1224815
feat: Create `train.sh`
saforem2 Aug 7, 2025
6925291
chore: Update `megatron/training_log_alcf.py`
saforem2 Aug 7, 2025
dacd3d2
chore: Update `ALCF/helpers.sh`
saforem2 Aug 15, 2025
b691201
cache indices support
zhenghh04 Aug 16, 2025
d64abca
feat: Add `ALCF/data-lists/aurora/olmo-mix-1124.txt`
saforem2 Aug 21, 2025
7012ebc
chore: Update `train_alcf.sh`
Aug 21, 2025
b11e4f0
Merge branch 'saforem2/fix-formatting' of https://github.com/argonne-…
Aug 21, 2025
90aeb82
chore: Update `ALCF/helpers.sh`
saforem2 Aug 21, 2025
f41b3ab
chore: Update `ALCF/helpers.sh`
saforem2 Aug 21, 2025
df0c30a
chore: Update `megatron/training_log_alcf.py`
saforem2 Aug 21, 2025
eb10947
docs: Add `ALCF/notes/AuroraGPT-small.md`
saforem2 Aug 21, 2025
50050fd
docs: Update `ALCF/notes/AuroraGPT-small.md`
saforem2 Aug 21, 2025
f12d970
feat: Update `ALCF/data-lists/sunspot/books.txt`
Aug 21, 2025
7ab7e35
chore: Update `ALCF/helpers.sh`
saforem2 Aug 22, 2025
456abc6
Added muonclip and fixed lr_finder logic
Aug 24, 2025
1a3653d
Merge branch 'saforem2/fix-formatting' into feature/cache_indices
saforem2 Aug 25, 2025
e9467fa
Updated muonclip lr adjuster
Aug 25, 2025
99b2592
chore: Update `ALCF/helpers.sh`
saforem2 Aug 26, 2025
4eab242
feat: Add `train_aGPT_2B_large_batch.sh`
saforem2 Aug 26, 2025
3848f6f
docs: Update `ALCF/notes/*.md`
saforem2 Aug 26, 2025
96c5a10
chore: Add `train_aGPT_7B_chain.sh`
saforem2 Aug 26, 2025
fc4d167
added cooldown phase option to constant LR decay
Aug 27, 2025
9b46590
feat: Add `train_aGPT_2B_large_batch.sh`
saforem2 Aug 27, 2025
3d83690
Merge branch 'saforem2/fix-formatting' into feature/cache_indices
saforem2 Aug 27, 2025
ec58e99
Merge pull request #93 from argonne-lcf/feature/cache_indices
saforem2 Sep 3, 2025
994f2a1
Merge pull request #88 from argonne-lcf/saforem2/fix-formatting
saforem2 Sep 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,200 changes: 183 additions & 1,017 deletions ALCF/README.md

Large diffs are not rendered by default.

1,438 changes: 1,438 additions & 0 deletions ALCF/data-lists/aurora/olmo-mix-1124.txt

Large diffs are not rendered by default.

8 changes: 5 additions & 3 deletions ALCF/data-lists/sunspot/books.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
0.0031025147279277244 /gila/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0000_text_document books
0.003102019887362634 /gila/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0001_text_document books
0.0009996745994661548 /gila/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0002_text_document books
0.0031025147279277244 /tegu/datascience/foremans/books-dataset/books-0000_text_document books
0.003102019887362634 /tegu/datascience/foremans/books-dataset/books-0001_text_document books
0.0009996745994661548 /tegu/datascience/foremans/books-dataset/books-0002_text_document books


118 changes: 118 additions & 0 deletions ALCF/examples/checkpoint_conversion/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Converting `AutoModel` to DeepSpeed ZeRO Checkpoint

We would like to convert an (arbitrarily large) HuggingFace model to a ZeRO
checkpoint so that we can use it for continual pre-training with
Megatron-DeepSpeed.

Previously, we had been using the approach from [ALCF / examples /
finetune_llama3](/ALCF/examples/finetune_llama3/README.md).

In particular, this approach works by:

1. Instantiate the Megatron-DeepSpeed (MDS) model as normal (with empty
weights), from [\[here\]](/tools/hf2megads_weight_converter.py#L712)

```python
from megatron.model import GPTModelPipe
ds_model = GPTModelPipe(config, num_tokentypes=0, parallel_output=True)
```

1. Instantiate the HF model \[[here\]](/tools/hf2megads_weight_converter.py#L725)

```python
from transformers import AutoModel
hf_model = AutoModel.from_pretrained("meta-llama/llama-3.3-70b-instruct")
```

3. Instantiate optimizer [\[here\]](/tools/hf2megads_weight_converter.py#L736)

1. Layer by layer, copy the weights from the HF model to the MDS model
\[[here\]](/tools/hf2megads_weight_converter.py#L766)


Unfortunately, for very large models, this will slowly consume available host
memory until it is exhausted causing the application to crash.

## Proposed Solution

Our proposed solution is simple and entirely contained in [ALCF / examples / checkpoint_conversion / hf_to_zero.py](/ALCF/examples/checkpoint_conversion/hf_to_zero.py).

Explicitly:

1. Create the HF model as normal
2. Pass it to `deepspeed.initalize(...)` to create the `DeepSpeedEngine`
3. `DeepSpeedEngine.save_checkpoint(...)` to save the checkpoint.


To run:

```bash
launch python3 \
ALCF/examples/checkpoint_conversion/hf_to_zero.py \
--zero-stage=3 \
--device=cpu \
--model='meta-llama/llama-3.3-70b-instruct'
```

> [!WARNING]
> I believe this approach is still not finished because I expect there will be
> naming mismatches between the layers of the HF model (now saved in our ZeRO
> checkpoint) and what our MDS model expects.
>
> This requires further testing to confirm, but we are now able to successfully
> convert the 70B model to a ZeRO checkpoint.

## Estimate Memory Needs for Llama-3.3-70B-Instruct

Deepspeed provides a nice mechanism for determining the memory needs of a model.

We provide below the summary for the Llama-3.3-70B-Instruct model of interest.

| Model Name | Model Size | Model Parameters | Largest Layer Parameters | Memory Needed |
|:----------------------:|:----------:|:----------------:|:------------------------:|:-------------:|
| Llama-3.3-70B-Instruct | 70B | 69503M | 1050M | 70.45GB |



```bash
$ python3 -c 'from transformers import AutoModel; \
∙ from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
∙ model = AutoModel.from_pretrained("meta-llama/Llama-3.3-70B-Instruct"); \
∙ estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=12, num_nodes=4)'
```

<details closed><summary>Output</summary>


```bash
Loading checkpoint shards: 100%|████████████████| 30/30 [08:28<00:00, 16.94s/it]
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 4 nodes, 12 GPUs per node.
SW: Model with 69503M total params, 1050M largest layer params.
per CPU | per GPU | Options
436.93GB | 3.91GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
4660.54GB | 3.91GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
388.38GB | 6.61GB | offload_param=none, offload_optimizer=cpu , zero_init=1
4660.54GB | 6.61GB | offload_param=none, offload_optimizer=cpu , zero_init=0
70.45GB | 28.19GB | offload_param=none, offload_optimizer=none, zero_init=1
4660.54GB | 28.19GB | offload_param=none, offload_optimizer=none, zero_init=0
took: 0h:08m:44s
```

</details>


- Model States and Memory Needs for Llama-3.3-70B-Instruct:


| per CPU | per GPU | Options |
|:---------:|:-------:|:-------------------------------------------------------:|
| 436.93GB | 3.91GB | offload_param=cpu, offload_optimizer=cpu, zero_init=1 |
| 4660.54GB | 3.91GB | offload_param=cpu, offload_optimizer=cpu, zero_init=0 |
| 388.38GB | 6.61GB | offload_param=none, offload_optimizer=cpu, zero_init=1 |
| 4660.54GB | 6.61GB | offload_param=none, offload_optimizer=cpu, zero_init=0 |
| 70.45GB | 28.19GB | offload_param=none, offload_optimizer=none, zero_init=1 |
| 4660.54GB | 28.19GB | offload_param=none, offload_optimizer=none, zero_init=0 |



174 changes: 174 additions & 0 deletions ALCF/examples/checkpoint_conversion/hf_to_zero.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
from argparse import Namespace
import os
from pathlib import Path
from typing import Optional

import ezpz
import torch
import torch.distributed
import deepspeed

from transformers import AutoModelForCausalLM

logger = ezpz.get_logger(__name__)


def parse_args():
import argparse

parser = argparse.ArgumentParser()
parser.add_argument(
'--model', type=str, default='meta-llama/Llama-3.2-1B-Instruct'
)
parser.add_argument('--device', type=str, default=None, required=False)
parser.add_argument('--train-batch-size', type=int, default=1)
parser.add_argument('--zero-stage', type=int, default=3)
# add arg for output directory
parser.add_argument('--output-dir', type=str, default='.')
parser.add_argument('--kv-offload', action='store_true')
parser.add_argument('--async-kv-offload', action='store_true')
parser.add_argument('--gen-len', type=int, default=1024)
parser.add_argument('--strict', action='store_true')
return parser.parse_args()


def meta_to_cpu(container, dtype=None):
if isinstance(container, torch.Tensor):
return torch.empty(*container.shape, dtype=dtype or container.dtype)
elif isinstance(container, tuple):
return tuple(meta_to_cpu(x, dtype) for x in container)
elif isinstance(container, dict):
return dict((k, meta_to_cpu(v, dtype)) for k, v in container.items())
else:
raise ValueError(f'Invalid type: {container}')


def get_model(
model_name: str = 'meta-llama/Llama-3.2-1B-Instruct',
dummy: Optional[bool] = None,
ignore_mismatched_sizes: bool = True,
) -> torch.nn.Module:
if dummy:
filename = Path('.').joinpath(
f'{model_name}.replace("/", "-")-hf-weights'
)
if not filename.exists():
from accelerate import init_empty_weights

logger.info('Creating dummy weights')
with init_empty_weights():
model = AutoModelForCausalLM.from_pretrained(
f'{model_name}',
ignore_mismatched_sizes=ignore_mismatched_sizes,
)
model.save_pretrained(
filename,
state_dict=meta_to_cpu(model.state_dict(), torch.float16),
)
return model

model = AutoModelForCausalLM.from_pretrained(
f'{model_name}',
ignore_mismatched_sizes=ignore_mismatched_sizes,
)
return model


def get_ds_config(
micro_batch_size: int = 1,
gradient_accumulation_steps: int = 2,
zero_stage: int = 3,
hidden_size: Optional[int] = None,
) -> dict:
train_batch_size = (
micro_batch_size * ezpz.get_world_size() * gradient_accumulation_steps
)
zero_config = {
'stage': zero_stage,
}
if zero_stage == 3:
if hidden_size is not None:
zero_config |= {
'stage3_prefetch_bucket_size': 2 * hidden_size * hidden_size,
'stage3_param_persistence_threshold': hidden_size,
'stage3_max_live_parameters': 2 * hidden_size * hidden_size,
}
zero_config |= {
'offload_optimizer': {
'device': 'cpu',
},
'offload_param': {
'device': 'cpu',
},
}

return {
'bf16': {'enabled': True},
'fp16': {'enabled': False},
'gradient_accumulation_steps': gradient_accumulation_steps,
'optimizer': {
'type': 'Adam',
},
'steps_per_print': 1,
'train_batch_size': train_batch_size,
'train_micro_batch_size_per_gpu': 1,
'wall_clock_breakdown': True,
'zero_optimization': zero_config,
}


def convert_checkpoint(args: Namespace):
if args.device is not None and args.device == 'cpu':
os.environ['TORCH_DEVICE'] = 'cpu'
os.environ['DS_ACCELERATOR'] = 'cpu'

if args.zero_stage == 3:
cm = deepspeed.zero.Init()
else:
from contextlib import nullcontext

cm = nullcontext()

with cm:
with torch.no_grad():
model = get_model(
args.model, ignore_mismatched_sizes=not args.strict
)

assert isinstance(model, torch.nn.Module)
if args.kv_offload:
model.set_kv_cache_offload(
True,
gen_len=args.gen_len,
async_kv_offload=args.async_kv_offload,
)

logger.info(f'model:\n{model}')
logger.info(f'{model.config=}')
ds_config = get_ds_config(
args.train_batch_size,
args.zero_stage,
hidden_size=model.config.hidden_size,
)
output_dir = Path('zero-checkpoints').joinpath(
f'{args.model}-zs{args.zero_stage}-mb{args.train_batch_size}'
)

ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()
model = ds_engine.module
logger.info(f'Saving ZeRO checkpoint to {output_dir}')

ds_engine.save_checkpoint(output_dir)

torch.distributed.barrier()


def main():
_ = ezpz.setup_torch(backend='DDP')
args = parse_args()
convert_checkpoint(args)


if __name__ == '__main__':
main()
73 changes: 73 additions & 0 deletions ALCF/examples/finetune_llama3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Finetune Llama3 from Hugging Face Checkpoint

1. **Clone + navigate into repo**:

```bash
git clone https://github.com/argonne-lcf/Megatron-DeepSpeed
cd Megatron-DeepSpeed
```

1. **Setup environment**:

```bash
PBS_O_WORKDIR=$(pwd) source <(curl -s https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh)
ezpz_setup_env

1. **Install Dependencies**:

```bash
python3 -m pip install deepspeed --require-virtualenv
python3 -m pip install -e "git+https://github.com/saforem2/ezpz#egg=ezpz" --require-virtualenv
python3 -m pip install -e .
```

1. **Download data**:

```bash
curl https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/refs/heads/main/alpaca_data.json -o dataset/alpaca_data.json
```

(from [here](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json))

1. **Download HF Checkpoint**:

```bash
MODEL="Llama-3.2-1B"
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download "meta-llama/${MODEL}" --local-dir "${MODEL}"
```

- _might_ require updating `huggingface_hub, hf_transfer`:

```bash
python3 -m pip install --upgrade "huggingface_hub[hf_transfer,cli]" hf_transfer`
```

1. **Convert HF --> MDS**:

```bash
TP=1 PP=1 ZERO_STAGE=1 MODEL_NAME=Llama-3.2-1B bash ALCF/examples/finetune_llama3/finetune_llama.sh convert_hf2mds
```

<details closed><summary>Old:</summary>

From original README:

### Usage

#### 1. Converting Hugging Face Model Weights to Megatron-Deepspeed Model

```bash
bash examples_deepspeed/finetune_hf_llama/finetune_llama.sh convert_hf2mds
```

This command writes the Hugging Face model weights into the Megatron-Deepspeed model and saves it. You can adjust the parallel configuration in the script.```convert_mds2hf``` can convert a Megatron-Deepspeed model into the Hugging Face format

#### 2. Fine-tuning Process

```bash
bash examples_deepspeed/finetune_hf_llama/finetune_llama.sh
```

Execute this command to initiate the finetuning process. The task originates from [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca.git).

</details>
18 changes: 18 additions & 0 deletions ALCF/examples/finetune_llama3/ds_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"train_batch_size" : 6,
"train_micro_batch_size_per_gpu": 1,
"steps_per_print": 1,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-4
}
},
"zero_optimization": {
"stage": 1
},
"bf16": {
"enabled": true
}
}
Loading