moe-recipes

User-friendly tool for seamless continual pre-training of Mixture of Expert Models

moe-recipes is a tool designed to make the continual pre-training of Large Language Models (LLMs) with Mixture of Experts (MoE) architecture easy and efficient. With an intuitive interface and flexible configuration options, researchers and developers can effortlessly manage training on any MoE model or dataset. The tool supports distributed training on large GPU clusters using DeepSpeed as its backend and offers extensive customization, enabling users to leverage cutting-edge techniques with ease.

What sets moe-recipes apart is its seamless integration with Hugging Face Transformers, allowing you to continue pre-training or perform instruction tuning on MoE models with minimal changes. This means there’s no need to convert checkpoints or deal with complex workflows—just focus on refining your model.

Feature	moe-recipes	llm-recipes
MoE Support	✅	❌
Dense LLM Support	❌	✅
Continual Pre-Training	✅	✅
Multi-Node Support	✅	✅

DATASET_DIR=/pat/to/datasets/
OUTPUT_DIR=/path/datasets/

mkdir -p $OUTPUT_DIR

python megatron_lm/tools/preprocess_data.py \
  --input ${DATASET_DIR}/wiki-base.jsonl \
  --output-prefix ${OUTPUT_DIR}/ja_wiki \
  --tokenizer-type Qwen2Tokenizer \
  --tokenizer-model /path/to/hf-checkpoints/Qwen2-57B-A14B/tokenizer.json \
  --append-eod \
  --workers 64

3. Training

We support Mixtral, Qwen-2-MoE, deepseek-moe. If you want to continually pre-train or instruction tune other models, you should modify src/llama_recipes/get_models.py and src/llama_recipes/get_model_decoder_layer.py.

We provide example scripts for continual pre-training for Mixtral-8x7B in scripts/tsubame/Mixtral-8x7B-VE/mixtral-8x7b.sh. You can modify the script to suit your needs.

Checkpoint formats

DeepSpeed format to Hugging Face format

You can convert DeepSpeed checkpoints to Hugging Face format in two stages: first, convert the checkpoint to PyTorch format, and then convert the PyTorch checkpoint to Hugging Face format.

1. Convert DeepSpeed checkpoint to PyTorch format

ITERATION=2000
FORMATTED_ITERATION=$(printf "iter_%07d" $ITERATION)

CHECK_POINT_DIR=/path/Mixtral-8x7b/${FORMATTED_ITERATION}

python tools/checkpoint-convert/zero_to_fp32.py \
  --checkpoint-dir $CHECK_POINT_DIR \
  --output-file $CHECK_POINT_DIR/model.pt \
  --debug

2. Convert PyTorch checkpoint to Hugging Face format

  ITERATION=2000
  FORMATTED_ITERATION=$(printf "iter_%07d" $ITERATION)

  CHECK_POINT_PATH=/path/to/checkpoints/Mixtral-8x7b/${FORMATTED_ITERATION}/model.pt
  OUTPUT_PATH=/path/to/Mixtral-8x7b/${FORMATTED_ITERATION}

  echo "convert ${CHECK_POINT_PATH} to ${OUTPUT_PATH}"

  mkdir -p $OUTPUT_PATH

  BASE_MODEL_CHECKPOINT=/path/to/Mixtral-8x7B-v0.1

  python tools/checkpoint-convert/convert_ckpt.py \
    --model $BASE_MODEL_CHECKPOINT \
    --ckpt $CHECK_POINT_PATH \
    --out $OUTPUT_PATH \
    --sequence-length 8192

Inference

After checkpoint conversion, you can use the Hugging Face Transformers library to load the converted checkpoint and perform inference.

The following is an example of how to do inference using the converted checkpoint (huggingface format):

python tools/inference/inference-mixtral.py \
  --model-path /path/to/converted/iter_0004000 \
  --tokenizer-path /path/to/tokenizer/path \
  --prompt "Tokyo is the capital of"

Training Speed and Scalability

We are currently working on improving the training speed and scalability of moe-recipes. We will update this section with more information soon.

Projects Using moe-recipes

Below are some of the projects where we have directly used moe-recipes:

Building a Large Japanese Web Corpus for Large Language Models

Citation

we are current submitting the paper to SC24 workshop, and the citation will be updated soon.

@software{fujii_moe-recipes_2024,
author = {Kazuki Fujii and Taishi Nakamura and Rio Yokota},
month = {March},
title = {{moe-recipes}},
url = {https://github.com/rioyokotalab/moe-recipes},
version = {1.0.0},
year = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
.vscode		.vscode
configs		configs
examples		examples
images		images
megatron_lm		megatron_lm
scripts		scripts
src/llama_recipes		src/llama_recipes
tests		tests
tools		tools
.gitignore		.gitignore
README.md		README.md
install.sh		install.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

moe-recipes

User-friendly tool for seamless continual pre-training of Mixture of Expert Models

Table of Contents

Installation

Multi-node Support

FlashAttention

Usage

MoE Instruction Tuning

1. Data Preparation

2. Change Dataset Class

3. Indexing

4. Training

MoE Continual Pre-Training

1. Data Preparation

2. Tokenize Data

3. Training

Checkpoint formats

DeepSpeed format to Hugging Face format

1. Convert DeepSpeed checkpoint to PyTorch format

2. Convert PyTorch checkpoint to Hugging Face format

Inference

Training Speed and Scalability

Projects Using moe-recipes

Citation

About

Releases 2

Packages

Languages

rioyokotalab/moe-recipes

Folders and files

Latest commit

History

Repository files navigation

moe-recipes

User-friendly tool for seamless continual pre-training of Mixture of Expert Models

Table of Contents

Installation

Multi-node Support

FlashAttention

Usage

MoE Instruction Tuning

1. Data Preparation

2. Change Dataset Class

3. Indexing

4. Training

MoE Continual Pre-Training

1. Data Preparation

2. Tokenize Data

3. Training

Checkpoint formats

DeepSpeed format to Hugging Face format

1. Convert DeepSpeed checkpoint to PyTorch format

2. Convert PyTorch checkpoint to Hugging Face format

Inference

Training Speed and Scalability

Projects Using moe-recipes

Citation

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages