saforem2 · saforem2 · Sep 17, 2025 · Dec 29, 2024 · Dec 29, 2024 · Dec 29, 2024
diff --git a/ALCF/README.md b/ALCF/README.md
diff --git a/ALCF/data-lists/aurora/olmo-mix-1124.txt b/ALCF/data-lists/aurora/olmo-mix-1124.txt
diff --git a/ALCF/data-lists/sunspot/books.txt b/ALCF/data-lists/sunspot/books.txt
@@ -1,3 +1,5 @@
-0.0031025147279277244 /gila/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0000_text_document books
-0.003102019887362634 /gila/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0001_text_document books
-0.0009996745994661548 /gila/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/books-0002_text_document books
+0.0031025147279277244 /tegu/datascience/foremans/books-dataset/books-0000_text_document books
+0.003102019887362634 /tegu/datascience/foremans/books-dataset/books-0001_text_document books
+0.0009996745994661548 /tegu/datascience/foremans/books-dataset/books-0002_text_document books
+
+
diff --git a/ALCF/examples/checkpoint_conversion/README.md b/ALCF/examples/checkpoint_conversion/README.md
@@ -0,0 +1,118 @@
+# Converting `AutoModel` to DeepSpeed ZeRO Checkpoint
+
+We would like to convert an (arbitrarily large) HuggingFace model to a ZeRO
+checkpoint so that we can use it for continual pre-training with
+Megatron-DeepSpeed.
+
+Previously, we had been using the approach from [ALCF / examples /
+finetune_llama3](/ALCF/examples/finetune_llama3/README.md).
+
+In particular, this approach works by:
+
+1. Instantiate the Megatron-DeepSpeed (MDS) model as normal (with empty
+   weights), from [\[here\]](/tools/hf2megads_weight_converter.py#L712)
+
+      ```python
+      from megatron.model import GPTModelPipe
+      ds_model = GPTModelPipe(config, num_tokentypes=0, parallel_output=True)
+      ```
+
+1. Instantiate the HF model \[[here\]](/tools/hf2megads_weight_converter.py#L725)
+
+    ```python
+    from transformers import AutoModel
+    hf_model = AutoModel.from_pretrained("meta-llama/llama-3.3-70b-instruct")
+    ```
+
+3. Instantiate optimizer [\[here\]](/tools/hf2megads_weight_converter.py#L736)
+
+1. Layer by layer, copy the weights from the HF model to the MDS model
+   \[[here\]](/tools/hf2megads_weight_converter.py#L766)
+
+
+Unfortunately, for very large models, this will slowly consume available host
+memory until it is exhausted causing the application to crash.
+
+## Proposed Solution
+
+Our proposed solution is simple and entirely contained in [ALCF / examples / checkpoint_conversion / hf_to_zero.py](/ALCF/examples/checkpoint_conversion/hf_to_zero.py).
+
+Explicitly:
+
+1. Create the HF model as normal
+2. Pass it to `deepspeed.initalize(...)` to create the `DeepSpeedEngine`
+3. `DeepSpeedEngine.save_checkpoint(...)` to save the checkpoint.
+
+
+To run:
+
+```bash
+launch python3 \
+  ALCF/examples/checkpoint_conversion/hf_to_zero.py \
+  --zero-stage=3 \
+  --device=cpu \
+  --model='meta-llama/llama-3.3-70b-instruct'
+```
+
+> [!WARNING]
+> I believe this approach is still not finished because I expect there will be
+> naming mismatches between the layers of the HF model (now saved in our ZeRO
+> checkpoint) and what our MDS model expects.
+> 
+> This requires further testing to confirm, but we are now able to successfully
+> convert the 70B model to a ZeRO checkpoint.
+
+## Estimate Memory Needs for Llama-3.3-70B-Instruct
+
+Deepspeed provides a nice mechanism for determining the memory needs of a model.
+
+We provide below the summary for the Llama-3.3-70B-Instruct model of interest.
+
+|       Model Name       | Model Size | Model Parameters | Largest Layer Parameters | Memory Needed |
+|:----------------------:|:----------:|:----------------:|:------------------------:|:-------------:|
+| Llama-3.3-70B-Instruct |     70B    |      69503M      |           1050M          |    70.45GB   | 
+
+
+
+```bash
+$ python3 -c 'from transformers import AutoModel; \
+∙ from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
+∙ model = AutoModel.from_pretrained("meta-llama/Llama-3.3-70B-Instruct"); \
+∙ estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=12, num_nodes=4)'
+```
+
+<details closed><summary>Output</summary>
+
+
+```bash
+Loading checkpoint shards: 100%|████████████████| 30/30 [08:28<00:00, 16.94s/it]
+Estimated memory needed for params, optim states and gradients for a:
+HW: Setup with 4 nodes, 12 GPUs per node.
+SW: Model with 69503M total params, 1050M largest layer params.
+  per CPU  |  per GPU |   Options
+  436.93GB |   3.91GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
+ 4660.54GB |   3.91GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
+  388.38GB |   6.61GB | offload_param=none, offload_optimizer=cpu , zero_init=1
+ 4660.54GB |   6.61GB | offload_param=none, offload_optimizer=cpu , zero_init=0
+   70.45GB |  28.19GB | offload_param=none, offload_optimizer=none, zero_init=1
+ 4660.54GB |  28.19GB | offload_param=none, offload_optimizer=none, zero_init=0
+took: 0h:08m:44s
+```
+
+</details>
+
+
+- Model States and Memory Needs for Llama-3.3-70B-Instruct:
+
+
+    |  per CPU  | per GPU |                         Options                         |
+    |:---------:|:-------:|:-------------------------------------------------------:|
+    |  436.93GB |  3.91GB |  offload_param=cpu, offload_optimizer=cpu, zero_init=1  |
+    | 4660.54GB |  3.91GB |  offload_param=cpu, offload_optimizer=cpu, zero_init=0  |
+    |  388.38GB |  6.61GB |  offload_param=none, offload_optimizer=cpu, zero_init=1 |
+    | 4660.54GB |  6.61GB |  offload_param=none, offload_optimizer=cpu, zero_init=0 |
+    |  70.45GB  | 28.19GB | offload_param=none, offload_optimizer=none, zero_init=1 |
+    | 4660.54GB | 28.19GB | offload_param=none, offload_optimizer=none, zero_init=0 |
+
+
+
diff --git a/ALCF/examples/checkpoint_conversion/hf_to_zero.py b/ALCF/examples/checkpoint_conversion/hf_to_zero.py
@@ -0,0 +1,174 @@
+from argparse import Namespace
+import os
+from pathlib import Path
+from typing import Optional
+
+import ezpz
+import torch
+import torch.distributed
+import deepspeed
+
+from transformers import AutoModelForCausalLM
+
+logger = ezpz.get_logger(__name__)
+
+
+def parse_args():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        '--model', type=str, default='meta-llama/Llama-3.2-1B-Instruct'
+    )
+    parser.add_argument('--device', type=str, default=None, required=False)
+    parser.add_argument('--train-batch-size', type=int, default=1)
+    parser.add_argument('--zero-stage', type=int, default=3)
+    # add arg for output directory
+    parser.add_argument('--output-dir', type=str, default='.')
+    parser.add_argument('--kv-offload', action='store_true')
+    parser.add_argument('--async-kv-offload', action='store_true')
+    parser.add_argument('--gen-len', type=int, default=1024)
+    parser.add_argument('--strict', action='store_true')
+    return parser.parse_args()
+
+
+def meta_to_cpu(container, dtype=None):
+    if isinstance(container, torch.Tensor):
+        return torch.empty(*container.shape, dtype=dtype or container.dtype)
+    elif isinstance(container, tuple):
+        return tuple(meta_to_cpu(x, dtype) for x in container)
+    elif isinstance(container, dict):
+        return dict((k, meta_to_cpu(v, dtype)) for k, v in container.items())
+    else:
+        raise ValueError(f'Invalid type: {container}')
+
+
+def get_model(
+    model_name: str = 'meta-llama/Llama-3.2-1B-Instruct',
+    dummy: Optional[bool] = None,
+    ignore_mismatched_sizes: bool = True,
+) -> torch.nn.Module:
+    if dummy:
+        filename = Path('.').joinpath(
+            f'{model_name}.replace("/", "-")-hf-weights'
+        )
+        if not filename.exists():
+            from accelerate import init_empty_weights
+
+            logger.info('Creating dummy weights')
+            with init_empty_weights():
+                model = AutoModelForCausalLM.from_pretrained(
+                    f'{model_name}',
+                    ignore_mismatched_sizes=ignore_mismatched_sizes,
+                )
+            model.save_pretrained(
+                filename,
+                state_dict=meta_to_cpu(model.state_dict(), torch.float16),
+            )
+            return model
+
+    model = AutoModelForCausalLM.from_pretrained(
+        f'{model_name}',
+        ignore_mismatched_sizes=ignore_mismatched_sizes,
+    )
+    return model
+
+
+def get_ds_config(
+    micro_batch_size: int = 1,
+    gradient_accumulation_steps: int = 2,
+    zero_stage: int = 3,
+    hidden_size: Optional[int] = None,
+) -> dict:
+    train_batch_size = (
+        micro_batch_size * ezpz.get_world_size() * gradient_accumulation_steps
+    )
+    zero_config = {
+        'stage': zero_stage,
+    }
+    if zero_stage == 3:
+        if hidden_size is not None:
+            zero_config |= {
+                'stage3_prefetch_bucket_size': 2 * hidden_size * hidden_size,
+                'stage3_param_persistence_threshold': hidden_size,
+                'stage3_max_live_parameters': 2 * hidden_size * hidden_size,
+            }
+        zero_config |= {
+            'offload_optimizer': {
+                'device': 'cpu',
+            },
+            'offload_param': {
+                'device': 'cpu',
+            },
+        }
+
+    return {
+        'bf16': {'enabled': True},
+        'fp16': {'enabled': False},
+        'gradient_accumulation_steps': gradient_accumulation_steps,
+        'optimizer': {
+            'type': 'Adam',
+        },
+        'steps_per_print': 1,
+        'train_batch_size': train_batch_size,
+        'train_micro_batch_size_per_gpu': 1,
+        'wall_clock_breakdown': True,
+        'zero_optimization': zero_config,
+    }
+
+
+def convert_checkpoint(args: Namespace):
+    if args.device is not None and args.device == 'cpu':
+        os.environ['TORCH_DEVICE'] = 'cpu'
+        os.environ['DS_ACCELERATOR'] = 'cpu'
+
+    if args.zero_stage == 3:
+        cm = deepspeed.zero.Init()
+    else:
+        from contextlib import nullcontext
+
+        cm = nullcontext()
+
+    with cm:
+        with torch.no_grad():
+            model = get_model(
+                args.model, ignore_mismatched_sizes=not args.strict
+            )
+
+    assert isinstance(model, torch.nn.Module)
+    if args.kv_offload:
+        model.set_kv_cache_offload(
+            True,
+            gen_len=args.gen_len,
+            async_kv_offload=args.async_kv_offload,
+        )
+
+    logger.info(f'model:\n{model}')
+    logger.info(f'{model.config=}')
+    ds_config = get_ds_config(
+        args.train_batch_size,
+        args.zero_stage,
+        hidden_size=model.config.hidden_size,
+    )
+    output_dir = Path('zero-checkpoints').joinpath(
+        f'{args.model}-zs{args.zero_stage}-mb{args.train_batch_size}'
+    )
+
+    ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
+    ds_engine.module.eval()
+    model = ds_engine.module
+    logger.info(f'Saving ZeRO checkpoint to {output_dir}')
+
+    ds_engine.save_checkpoint(output_dir)
+
+    torch.distributed.barrier()
+
+
+def main():
+    _ = ezpz.setup_torch(backend='DDP')
+    args = parse_args()
+    convert_checkpoint(args)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/ALCF/examples/finetune_llama3/README.md b/ALCF/examples/finetune_llama3/README.md
@@ -0,0 +1,73 @@
+# Finetune Llama3 from Hugging Face Checkpoint
+
+1. **Clone + navigate into repo**:
+
+    ```bash
+    git clone https://github.com/argonne-lcf/Megatron-DeepSpeed
+    cd Megatron-DeepSpeed
+    ```
+
+1. **Setup environment**:
+
+    ```bash
+    PBS_O_WORKDIR=$(pwd) source <(curl -s https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh)
+    ezpz_setup_env
+
+1. **Install Dependencies**:
+
+    ```bash
+    python3 -m pip install deepspeed --require-virtualenv
+    python3 -m pip install -e "git+https://github.com/saforem2/ezpz#egg=ezpz" --require-virtualenv
+    python3 -m pip install -e .
+    ```
+
+1. **Download data**:
+
+    ```bash
+    curl https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/refs/heads/main/alpaca_data.json -o dataset/alpaca_data.json
+    ```
+
+   (from [here](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json))
+
+1. **Download HF Checkpoint**:
+
+    ```bash
+    MODEL="Llama-3.2-1B"
+    HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download "meta-llama/${MODEL}" --local-dir "${MODEL}"
+    ```
+
+    - _might_ require updating `huggingface_hub, hf_transfer`:
+
+        ```bash
+        python3 -m pip install --upgrade "huggingface_hub[hf_transfer,cli]" hf_transfer`
+        ```
+
+1. **Convert HF --> MDS**:
+
+    ```bash
+    TP=1 PP=1 ZERO_STAGE=1 MODEL_NAME=Llama-3.2-1B bash ALCF/examples/finetune_llama3/finetune_llama.sh convert_hf2mds
+    ```
+
+<details closed><summary>Old:</summary>
+
+From original README:
+
+### Usage
+
+#### 1. Converting Hugging Face Model Weights to Megatron-Deepspeed Model
+
+```bash
+bash examples_deepspeed/finetune_hf_llama/finetune_llama.sh convert_hf2mds
+```
+
+This command writes the Hugging Face model weights into the Megatron-Deepspeed model and saves it. You can adjust the parallel configuration in the script.```convert_mds2hf``` can convert a Megatron-Deepspeed model into the Hugging Face format
+
+#### 2. Fine-tuning Process
+
+```bash
+bash examples_deepspeed/finetune_hf_llama/finetune_llama.sh
+```
+
+Execute this command to initiate the finetuning process. The task originates from [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca.git).
+
+</details>
diff --git a/ALCF/examples/finetune_llama3/ds_config.json b/ALCF/examples/finetune_llama3/ds_config.json
@@ -0,0 +1,18 @@
+{
+  "train_batch_size" : 6,
+  "train_micro_batch_size_per_gpu": 1,
+  "steps_per_print": 1,
+  "gradient_accumulation_steps": 1,
+  "optimizer": {
+      "type": "Adam",
+      "params": {
+          "lr": 1e-4
+      }
+  },
+  "zero_optimization": {
+    "stage": 1
+  },
+  "bf16": {
+    "enabled": true
+  }
+}