Pull in changes from `argonne-lcf/Megatron-DeepSpeed` #14

saforem2 · 2025-09-17T19:07:19Z

Copilot Summary

This pull request introduces a new workflow for converting large HuggingFace models to DeepSpeed ZeRO checkpoints, updates documentation for both conversion and finetuning processes, and adds supporting scripts and configuration files for Llama3 models. The changes aim to streamline model conversion and finetuning on ALCF systems, improve reproducibility, and provide clear instructions and configs for users.

Model Conversion and Finetuning Workflow

Added hf_to_zero.py script to automate conversion of HuggingFace models to DeepSpeed ZeRO checkpoints, including argument parsing, device selection, and configuration generation.
Introduced a new README (checkpoint_conversion/README.md) detailing the conversion process, memory estimation, and usage instructions for large models.
Updated finetuning workflow for Llama3 models, including a new README (finetune_llama3/README.md), a comprehensive shell script (finetune_llama.sh), and DeepSpeed configuration files (ds_config.json, ds_config_empty.json). [1] [2] [3] [4]

Dataset and Notes Updates

Changed dataset paths in sunspot/books.txt to reference new locations, improving data management for training.
Added notes documenting AuroraGPT-3B organization, configs, and experimental plans for future reference and reproducibility.

…lcf/Megatron-DeepSpeed into saforem2/fix-formatting

cache indices support

chore: Format `megatron/*`

saforem2 and others added 30 commits December 29, 2024 11:28

docs: Add ALCF/notes/universal_checkpoint_bug.md

f1b9e8d

Update universal_checkpoint_bug.md

246e82b

Merge pull request #73 from argonne-lcf/docs-ucp-bug

439c777

feat: Add ALCF/examples/finetune_llama3/*

03da571

docs: Update ALCF/examples/finetune_llama3/*

19bdff0

chore: Update tools/hf2megads_weight_converter.py

e2cb209

feat: Add ALCF/examples/finetune_llama3p2_1B/*

a868788

feat: Update ALCF/examples/finetune_llama3/*

babef03

Update README.md

7727f93

Remove redundant ALCF/examples/finetune_llama3p2_1B/*

13666a1

chore: Add DummyOptimizer to tools/hf2megads_weight_converter.py

6b5eed5

fix: NO_FLASH_ATTN on Polaris in ALCF/helpers.sh

2f9e19d

Add {sunspot, sophia} in ALCF/examples/finetune_llama3/*

0636aea

Merge branch 'main' into finetune-llama3

b800277

fix: Call set_ccl_vars_on_aurora only if WORLD_SIZE > 1

adeca53

Update README.md

4d0077c

Merge pull request #76 from argonne-lcf/saforem2-patch-2

3af7eb4

Merge pull request #75 from argonne-lcf/fix-single-node

8098a70

added adopt optimizer

e7990d5

adopt optimizer

0948f84

fix: Resolve merge commit

1c04f64

chore: Update Llama FT

c1f99b9

chore: Update megatron/data/prompt_dataset.py

3991f25

feat: Add ALCF/examples/checkpoint_conversion/*

101d0ed

docs: Update ALCF/examples/checkpoint_conversion/README.md

10903ef

Update README.md

19b3b74

docs: Add ALCF/notes/deprecated.md

0dcc101

docs: Update ALCF/README.md

a3424b6

docs: Update ALCF/notes/deprecated.md

bf50938

Merge branch 'main' into finetune-llama3

b9258f5

saforem2 and others added 29 commits July 15, 2025 12:42

chore: Update megatron/timers.py

ebb1899

fixed infinite schedulers bugs and dshampoo name in arguments

e3b0398

chore: Update ALCF/README.md

1508ff5

feat: Create train.sh

1224815

chore: Update megatron/training_log_alcf.py

6925291

chore: Update ALCF/helpers.sh

dacd3d2

cache indices support

b691201

feat: Add ALCF/data-lists/aurora/olmo-mix-1124.txt

d64abca

chore: Update train_alcf.sh

7012ebc

Merge branch 'saforem2/fix-formatting' of https://github.com/argonne-…

b11e4f0

…lcf/Megatron-DeepSpeed into saforem2/fix-formatting

chore: Update ALCF/helpers.sh

90aeb82

chore: Update ALCF/helpers.sh

f41b3ab

chore: Update megatron/training_log_alcf.py

df0c30a

docs: Add ALCF/notes/AuroraGPT-small.md

eb10947

docs: Update ALCF/notes/AuroraGPT-small.md

50050fd

feat: Update ALCF/data-lists/sunspot/books.txt

f12d970

chore: Update ALCF/helpers.sh

7ab7e35

Added muonclip and fixed lr_finder logic

456abc6

Merge branch 'saforem2/fix-formatting' into feature/cache_indices

1a3653d

Updated muonclip lr adjuster

e9467fa

chore: Update ALCF/helpers.sh

99b2592

feat: Add train_aGPT_2B_large_batch.sh

4eab242

docs: Update ALCF/notes/*.md

3848f6f

chore: Add train_aGPT_7B_chain.sh

96c5a10

added cooldown phase option to constant LR decay

fc4d167

feat: Add train_aGPT_2B_large_batch.sh

9b46590

Merge branch 'saforem2/fix-formatting' into feature/cache_indices

3d83690

Merge pull request #93 from argonne-lcf/feature/cache_indices

ec58e99

cache indices support

Merge pull request #88 from argonne-lcf/saforem2/fix-formatting

994f2a1

chore: Format `megatron/*`

saforem2 merged commit d2c6684 into saforem2:main Sep 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull in changes from `argonne-lcf/Megatron-DeepSpeed` #14

Pull in changes from `argonne-lcf/Megatron-DeepSpeed` #14

Uh oh!

saforem2 commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Pull in changes from argonne-lcf/Megatron-DeepSpeed #14

Pull in changes from argonne-lcf/Megatron-DeepSpeed #14

Uh oh!

Conversation

saforem2 commented Sep 17, 2025

Copilot Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Pull in changes from `argonne-lcf/Megatron-DeepSpeed` #14

Pull in changes from `argonne-lcf/Megatron-DeepSpeed` #14