Skip to content

Conversation

@saforem2
Copy link
Owner

Copilot Summary

This pull request introduces a new workflow for converting large HuggingFace models to DeepSpeed ZeRO checkpoints, updates documentation for both conversion and finetuning processes, and adds supporting scripts and configuration files for Llama3 models. The changes aim to streamline model conversion and finetuning on ALCF systems, improve reproducibility, and provide clear instructions and configs for users.

Model Conversion and Finetuning Workflow

  • Added hf_to_zero.py script to automate conversion of HuggingFace models to DeepSpeed ZeRO checkpoints, including argument parsing, device selection, and configuration generation.
  • Introduced a new README (checkpoint_conversion/README.md) detailing the conversion process, memory estimation, and usage instructions for large models.
  • Updated finetuning workflow for Llama3 models, including a new README (finetune_llama3/README.md), a comprehensive shell script (finetune_llama.sh), and DeepSpeed configuration files (ds_config.json, ds_config_empty.json). [1] [2] [3] [4]

Dataset and Notes Updates

  • Changed dataset paths in sunspot/books.txt to reference new locations, improving data management for training.
  • Added notes documenting AuroraGPT-3B organization, configs, and experimental plans for future reference and reproducibility.

saforem2 and others added 30 commits December 29, 2024 11:28
saforem2 and others added 29 commits July 15, 2025 12:42
@saforem2 saforem2 merged commit d2c6684 into saforem2:main Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants