diff --git a/README.md b/README.md
index aec6755..21064d3 100644
--- a/README.md
+++ b/README.md
@@ -1,18 +1,53 @@
-# MoE Recipes
+<div align="center">
+
+moe-recipes
+===========================
+<h4>User-friendly tool for seamless continual pre-training of Mixture of Expert Models</h4>
+
+<img src="images/moe-recipes-logo.webp" alt="moe-recipes" width="300px">
+
+<div align="left">
+
+moe-recipes is a tool designed to make the continual pre-training of Large Language Models (LLMs) with Mixture of Experts (MoE) architecture easy and efficient. With an intuitive interface and flexible configuration options, researchers and developers can effortlessly manage training on any MoE model or dataset. The tool supports distributed training on large GPU clusters using DeepSpeed as its backend and offers extensive customization, enabling users to leverage cutting-edge techniques with ease.
+
+What sets moe-recipes apart is its seamless integration with Hugging Face Transformers, allowing you to continue pre-training or perform instruction tuning on MoE models with minimal changes. This means there’s no need to convert checkpoints or deal with complex workflows—just focus on refining your model.
+
+| Feature                         | moe-recipes | llm-recipes |
+|---------------------------------|-------------|---------------|
+| **MoE Support**                 | ✅          | ❌            |
+| **Dense LLM Support**           | ❌          | ✅            |
+| **Continual Pre-Training**      | ✅          | ✅            |
+| **Multi-Node Support**          | ✅          | ✅            |
 
 # Table of Contents
 
-1. [Installation](#installation)
+- [Installation](#installation)
+  - [Multi-node Support](#multi-node-support)
+  - [FlashAttention](#flashattention)
+- [Usage](#usage)
+  - [MoE Instruction Tuning](#moe-instruction-tuning)
+  - [MoE Continual Pre-Training](#moe-continual-pre-training)
+- [Checkpoint formats](#checkpoint-formats)
+  - [DeepSpeed format to Hugging Face format](#deepspeed-format-to-hugging-face-format)
+- [Inference](#inference)
+- [Training Speed and Scalability](#training-speed-and-scalability)
+- [Projects Using moe-recipes](#projects-using-moe-recipes)
+- [Citation](#citation)
 
 ## Installation
 
-To install the package, run the following command:
+This package has been tested with Python 3.10 and 3.11. The recommended environment is with CUDA Toolkit 12.1.
+
+To install the required packages, simply run:
 
 ```bash
 pip install -r requirements.txt
 ```
+> Note: The requirements.txt assumes that CUDA Toolkit 12.1 is installed on your system.
+
+### Multi-node Support
 
-If you want to use the library in multi-nodes, you need to install the below packages:
+For multi-node support, ensure you have the following dependencies installed:
 
 ```bash
 module load openmpi/4.x.x
@@ -22,9 +57,176 @@ pip install mpi4py
 
 ### FlashAttention
 
-To install the FlashAttention, run the following command: (GPU is required)
+For GPU-accelerated FlashAttention, follow these steps:
 
 ```bash
 pip install ninja packaging wheel
 pip install flash-attn --no-build-isolation
 ```
+
+## Usage
+
+### MoE Instruction Tuning
+
+we experimentally support instruction tuning for MoE models.
+we don't fully test the instruction tuning, so please be careful when using it.
+
+#### 1. **Data Preparation**
+
+Prepare your data in the below format and save it as a JSONL file:
+
+```jsonl
+{
+  "input": [
+    {
+      "role": "user",
+      "content": "What is the weather like today?"
+    }
+  ],
+  "output": {
+    "role": "assistant",
+    "content": "The weather is sunny with a high of 25 degrees."
+  }
+}
+```
+
+#### 2. **Change Dataset Class**
+
+Please modify the `Dataset` class in `src/llama_recipes/utils/instruction_tuning.py` to adjust to the model's expected format.
+But, almost all the models have chat templates, so you may not need to change the `Dataset` class.
+
+#### 3. **Indexing**
+
+To load dataset efficiently, create an index file using the following command:
+
+```bash
+python tools/pre-process/index_dataset.py \
+  --data-file-path <path-to-jsonl-file>
+```
+
+After indexing, `.index_cache` directory will be created in the same directory as the JSONL file.
+
+#### 4. **Training**
+
+We does not provide a script for instruction tuning, but we are planning to provide it in the future.
+
+### MoE Continual Pre-Training
+
+#### 1. **Data Preparation**
+
+Prepare your data in the below format and save it as a JSONL file:
+
+```jsonl
+{
+  "text": "What is the weather like today?\nThe weather is sunny with a high of 25 degrees."
+}
+```
+
+#### 2. **Tokenize Data**
+
+Tokenize your data using the tokenizer provided by the model you are using.
+For example, to tokenize data for Qwen-2-57A, run the following command:
+
+```bash
+DATASET_DIR=/pat/to/datasets/
+OUTPUT_DIR=/path/datasets/
+
+mkdir -p $OUTPUT_DIR
+
+python megatron_lm/tools/preprocess_data.py \
+  --input ${DATASET_DIR}/wiki-base.jsonl \
+  --output-prefix ${OUTPUT_DIR}/ja_wiki \
+  --tokenizer-type Qwen2Tokenizer \
+  --tokenizer-model /path/to/hf-checkpoints/Qwen2-57B-A14B/tokenizer.json \
+  --append-eod \
+  --workers 64
+```
+
+#### 3. **Training**
+
+We support Mixtral, Qwen-2-MoE, deepseek-moe.
+If you want to continually pre-train or instruction tune other models, you should modify `src/llama_recipes/get_models.py` and `src/llama_recipes/get_model_decoder_layer.py`.
+
+We provide example scripts for continual pre-training for Mixtral-8x7B in `scripts/tsubame/Mixtral-8x7B-VE/mixtral-8x7b.sh`.
+You can modify the script to suit your needs.
+
+## Checkpoint formats
+
+### DeepSpeed format to Hugging Face format
+
+You can convert DeepSpeed checkpoints to Hugging Face format in two stages: first, convert the checkpoint to PyTorch format, and then convert the PyTorch checkpoint to Hugging Face format.
+
+#### 1. **Convert DeepSpeed checkpoint to PyTorch format**
+
+```bash
+ITERATION=2000
+FORMATTED_ITERATION=$(printf "iter_%07d" $ITERATION)
+
+CHECK_POINT_DIR=/path/Mixtral-8x7b/${FORMATTED_ITERATION}
+
+python tools/checkpoint-convert/zero_to_fp32.py \
+  --checkpoint-dir $CHECK_POINT_DIR \
+  --output-file $CHECK_POINT_DIR/model.pt \
+  --debug
+```
+
+#### 2. **Convert PyTorch checkpoint to Hugging Face format**
+
+```bash
+  ITERATION=2000
+  FORMATTED_ITERATION=$(printf "iter_%07d" $ITERATION)
+
+  CHECK_POINT_PATH=/path/to/checkpoints/Mixtral-8x7b/${FORMATTED_ITERATION}/model.pt
+  OUTPUT_PATH=/path/to/Mixtral-8x7b/${FORMATTED_ITERATION}
+
+  echo "convert ${CHECK_POINT_PATH} to ${OUTPUT_PATH}"
+
+  mkdir -p $OUTPUT_PATH
+
+  BASE_MODEL_CHECKPOINT=/path/to/Mixtral-8x7B-v0.1
+
+  python tools/checkpoint-convert/convert_ckpt.py \
+    --model $BASE_MODEL_CHECKPOINT \
+    --ckpt $CHECK_POINT_PATH \
+    --out $OUTPUT_PATH \
+    --sequence-length 8192
+```
+
+## Inference
+
+After checkpoint conversion, you can use the Hugging Face Transformers library to load the converted checkpoint and perform inference.
+
+The following is an example of how to do inference using the converted checkpoint (huggingface format):
+
+```bash
+python tools/inference/inference-mixtral.py \
+  --model-path /path/to/converted/iter_0004000 \
+  --tokenizer-path /path/to/tokenizer/path \
+  --prompt "Tokyo is the capital of"
+```
+
+## Training Speed and Scalability
+
+We are currently working on improving the training speed and scalability of moe-recipes.
+We will update this section with more information soon.
+
+## Projects Using moe-recipes
+
+Below are some of the projects where we have directly used moe-recipes:
+
+- [Building a Large Japanese Web Corpus for Large Language Models](https://arxiv.org/abs/2404.17733)
+
+## Citation
+
+we are current submitting the paper to SC24 workshop, and the citation will be updated soon.
+
+```bibtex
+@software{fujii_moe-recipes_2024,
+author = {Kazuki Fujii and Taishi Nakamura and Rio Yokota},
+month = {March},
+title = {{moe-recipes}},
+url = {https://github.com/rioyokotalab/moe-recipes},
+version = {1.0.0},
+year = {2024}
+}
+```
diff --git a/images/moe-recipes-logo.webp b/images/moe-recipes-logo.webp
new file mode 100644
index 0000000..0d70267
Binary files /dev/null and b/images/moe-recipes-logo.webp differ
diff --git a/requirements.txt b/requirements.txt
index 1d19677..de89e2a 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -11,11 +11,10 @@ peft
 appdirs
 loralib
 scipy
-py7zr  # 圧縮解凍library
+py7zr
 bitsandbytes
-fire  # argparser
+fire
 
-# formatter & linter
 black
 flake8
 
diff --git a/src/llama_recipes/arguments.py b/src/llama_recipes/arguments.py
index c6411ce..e94f53e 100644
--- a/src/llama_recipes/arguments.py
+++ b/src/llama_recipes/arguments.py
@@ -166,6 +166,13 @@ def _add_data_args(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
         help='Number of additional vocabulary tokens. They are used for span masking in the T5 model'
     )
     group.add_argument("--dataset-cyclic", action="store_true")
+    # instruction tuning
+    group.add_argument(
+        '--system-prompt-role', type=str, default="system"
+    )
+    group.add_argument(
+        '--system-prompt-content', type=str, default='あなたは誠実で優秀な日本人のアシスタントです。'
+    )
 
     return parser
 
diff --git a/src/llama_recipes/inference/__init__.py b/src/llama_recipes/inference/__init__.py
deleted file mode 100644
index 54ed04d..0000000
--- a/src/llama_recipes/inference/__init__.py
+++ /dev/null
@@ -1,2 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
\ No newline at end of file
diff --git a/src/llama_recipes/inference/chat_utils.py b/src/llama_recipes/inference/chat_utils.py
deleted file mode 100644
index 8d781e3..0000000
--- a/src/llama_recipes/inference/chat_utils.py
+++ /dev/null
@@ -1,65 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
-
-import json
-from typing import List, Literal, TypedDict
-
-
-Role = Literal["user", "assistant"]
-
-
-class Message(TypedDict):
-    role: Role
-    content: str
-
-
-Dialog = List[Message]
-
-B_INST, E_INST = "[INST]", "[/INST]"
-B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
-def format_tokens(dialogs, tokenizer):
-    prompt_tokens = []
-    for dialog in dialogs:
-        if dialog[0]["role"] == "system":
-            dialog = [
-            {
-                "role": dialog[1]["role"],
-                "content": B_SYS
-                + dialog[0]["content"]
-                + E_SYS
-                + dialog[1]["content"],
-            }
-        ] + dialog[2:]
-        assert all([msg["role"] == "user" for msg in dialog[::2]]) and all(
-            [msg["role"] == "assistant" for msg in dialog[1::2]]
-        ), (
-            "model only supports 'system','user' and 'assistant' roles, "
-            "starting with user and alternating (u/a/u/a/u...)"
-        )
-        """
-        Please verify that your tokenizer support adding "[INST]", "[/INST]" to your inputs.
-        Here, we are adding it manually.
-        """
-        dialog_tokens: List[int] = sum(
-            [
-                tokenizer.encode(
-                    f"{B_INST} {(prompt['content']).strip()} {E_INST} {(answer['content']).strip()} ",
-                )
-                for prompt, answer in zip(dialog[::2], dialog[1::2])
-            ],
-            [],
-        )
-        assert (
-            dialog[-1]["role"] == "user"
-        ), f"Last message must be from user, got {dialog[-1]['role']}"
-        dialog_tokens += tokenizer.encode(
-            f"{B_INST} {(dialog[-1]['content']).strip()} {E_INST}",
-        )
-        prompt_tokens.append(dialog_tokens)
-    return prompt_tokens
-        
-
-def read_dialogs_from_file(file_path):
-    with open(file_path, 'r') as file:
-        dialogs = json.load(file)
-    return dialogs
\ No newline at end of file
diff --git a/src/llama_recipes/inference/checkpoint_converter_fsdp_hf.py b/src/llama_recipes/inference/checkpoint_converter_fsdp_hf.py
deleted file mode 100644
index 175a97c..0000000
--- a/src/llama_recipes/inference/checkpoint_converter_fsdp_hf.py
+++ /dev/null
@@ -1,65 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
-
-# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
-
-import fire
-import os
-import sys
-import yaml
-
-from transformers import LlamaTokenizer
-
-from llama_recipes.inference.model_utils import  load_llama_from_config
-
-# Get the current file's directory
-current_directory = os.path.dirname(os.path.abspath(__file__))
-
-# Get the parent directory
-parent_directory = os.path.dirname(current_directory)
-
-# Append the parent directory to sys.path
-sys.path.append(parent_directory)
-from model_checkpointing import load_sharded_model_single_gpu
-
-def main(
-    fsdp_checkpoint_path="", # Path to FSDP Sharded model checkpoints
-    consolidated_model_path="", # Path to save the HF converted model checkpoints
-    HF_model_path_or_name="" # Path/ name of the HF model that include config.json and tokenizer_config.json (e.g. meta-llama/Llama-2-7b-chat-hf)
-    ):
-    
-    try:
-        file_name = 'train_params.yaml'
-        # Combine the directory and file name to create the full path
-        train_params_path = os.path.join(fsdp_checkpoint_path, file_name)
-        # Open the file
-        with open(train_params_path, 'r') as file:
-            # Load the YAML data
-            data = yaml.safe_load(file)
-
-            # Access the 'model_name' field
-            HF_model_path_or_name = data.get('model_name')
-
-            print(f"Model name: {HF_model_path_or_name}")
-    except FileNotFoundError:
-        print(f"The file {train_params_path} does not exist.")
-        HF_model_path_or_name = input("Please enter the model name: ")
-        print(f"Model name: {HF_model_path_or_name}")
-    except Exception as e:
-        print(f"An error occurred: {e}")
-        
-        
-    #load the HF model definition from config
-    model_def = load_llama_from_config(HF_model_path_or_name)
-    print("model is loaded from config")
-    #load the FSDP sharded checkpoints into the model
-    model = load_sharded_model_single_gpu(model_def, fsdp_checkpoint_path)
-    print("model is loaded from FSDP checkpoints")
-    #loading the tokenizer form the  model_path
-    tokenizer = LlamaTokenizer.from_pretrained(HF_model_path_or_name)
-    tokenizer.save_pretrained(consolidated_model_path)
-    #save the FSDP sharded checkpoints in HF format
-    model.save_pretrained(consolidated_model_path)
-    print(f"HuggingFace model checkpoints has been saved in {consolidated_model_path}")
-if __name__ == "__main__":
-    fire.Fire(main)
diff --git a/src/llama_recipes/inference/model_utils.py b/src/llama_recipes/inference/model_utils.py
deleted file mode 100644
index 02785e9..0000000
--- a/src/llama_recipes/inference/model_utils.py
+++ /dev/null
@@ -1,30 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# This software may be used and distributed according to the terms of the GNU General Public License version 3.
-
-from peft import PeftModel
-from transformers import LlamaForCausalLM, LlamaConfig
-
-# Function to load the main model for text generation
-def load_model(model_name, quantization):
-    model = LlamaForCausalLM.from_pretrained(
-        model_name,
-        return_dict=True,
-        load_in_8bit=quantization,
-        device_map="auto",
-        low_cpu_mem_usage=True,
-    )
-    return model
-
-
-# Function to load the PeftModel for performance optimization
-def load_peft_model(model, peft_model):
-    peft_model = PeftModel.from_pretrained(model, peft_model)
-    return peft_model
-
-# Loading the model from config to load FSDP checkpoints into that
-def load_llama_from_config(config_path):
-    model_config = LlamaConfig.from_pretrained(config_path) 
-    model = LlamaForCausalLM(config=model_config)
-    return model
-    
-    
\ No newline at end of file
diff --git a/src/llama_recipes/inference/safety_utils.py b/src/llama_recipes/inference/safety_utils.py
deleted file mode 100644
index 38a44d4..0000000
--- a/src/llama_recipes/inference/safety_utils.py
+++ /dev/null
@@ -1,169 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
-
-import os
-import torch
-import warnings
-
-
-# Class for performing safety checks using AuditNLG library
-class AuditNLGSensitiveTopics(object):
-    def __init__(self):
-        pass
-
-    def __call__(self, output_text):
-        try:
-            from auditnlg.safety.exam import safety_scores
-        except ImportError as e:
-            print("Could not import optional dependency: auditnlg\nPlease install manually with:\n pip install auditnlg\nFollowed by:\npip install -r requirements.txt")
-            raise e
-            
-
-        data = [{"output": output_text}]
-
-        result = safety_scores(data=data, method="sensitive_topics")
-        scores = result[1]["all_scores"][0]
-        is_safe = scores["pred_class"] == "none"
-        report = ""
-        if not is_safe:
-            report += f"Predicted class: {scores['pred_class']}\n"
-            report += "|" + "|".join(f"{n:^10}" for n in [list(k.keys())[0] for k in scores["class_scores"]]) + "|\n"
-            report += "|" + "|".join(f"{n:^10.5}" for n in [list(k.values())[0] for k in scores["class_scores"]]) + "|\n"
-        return "Sensitive Topics", is_safe, report
-    
-    
-class SalesforceSafetyChecker(object):
-    def __init__(self):
-        pass
-
-    def __call__(self, output_text):
-        from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
-
-        config = AutoConfig.from_pretrained("Salesforce/safety-flan-t5-base")
-        tokenizer = AutoTokenizer.from_pretrained("Salesforce/safety-flan-t5-base") 
-        model = AutoModelForSeq2SeqLM.from_pretrained("Salesforce/safety-flan-t5-base", config=config)
-        model.eval()
-
-        prefix = "Is the <Text> field safe or unsafe?"
-        input_ids = tokenizer(prefix + " <Text> " + output_text + " <Context> ", return_tensors="pt").input_ids
-        
-        if len(input_ids[0]) > 512:
-            warnings.warn(
-                "Input length is > 512 token. Safety check result could be incorrect."
-            )
-
-        with torch.no_grad():
-            outputs = model.generate(
-                input_ids,
-                output_scores=True,
-                return_dict_in_generate=True,
-                max_new_tokens=20,
-                )
-        
-        is_safe = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True).split(" ")[0] == "safe"    
-            
-        report = ""
-        if not is_safe:
-            true_false_ids = tokenizer("true false").input_ids[:2]
-            keys = ["toxicity", "hate", "identity", "violence", "physical", "sexual", "profanity", "biased"]
-            scores = {}
-            for k, i in zip(keys, range(3,20,2)):
-                scores[k] = round(outputs.scores[i][0,true_false_ids].softmax(dim=0)[0].item(), 5)
-            
-            report += "|" + "|".join(f"{n:^10}" for n in scores.keys()) + "|\n"
-            report += "|" + "|".join(f"{n:^10}" for n in scores.values()) + "|\n"
-        return "Salesforce Content Safety Flan T5 Base", is_safe, report
-        
-
-    def get_total_length(self, data):
-        prefix = "Is the <Text> field safe or unsafe "
-        input_sample = "<Text> {output} <Context> ".format(**data[0])
-
-        return len(self.tokenizer(prefix + input_sample)["input_ids"])
-
-
-# Class for performing safety checks using Azure Content Safety service
-class AzureSaftyChecker(object):
-    def __init__(self):
-        try:
-            from azure.ai.contentsafety import ContentSafetyClient
-            from azure.core.credentials import AzureKeyCredential
-
-            key = os.environ["CONTENT_SAFETY_KEY"]
-            endpoint = os.environ["CONTENT_SAFETY_ENDPOINT"]
-        except ImportError:
-            raise Exception(
-                "Could not import required package azure-ai-contentsafety. Install with: pip install azure-ai-contentsafety"
-            )
-        except KeyError:
-            raise Exception(
-                "Environment variables not set. Please set CONTENT_SAFETY_KEY and CONTENT_SAFETY_ENDPOINT."
-            )
-
-        self.client = ContentSafetyClient(endpoint, AzureKeyCredential(key))
-
-    def __call__(self, output_text):
-        from azure.core.exceptions import HttpResponseError
-        from azure.ai.contentsafety.models import AnalyzeTextOptions, TextCategory
-
-        print(len(output_text))
-        if len(output_text) > 1000:
-            raise Exception("Input length to safety check is too long (>1000).")
-
-        categories = [
-            TextCategory.VIOLENCE,
-            TextCategory.SELF_HARM,
-            TextCategory.SEXUAL,
-            TextCategory.HATE,
-        ]
-
-        request = AnalyzeTextOptions(text=output_text, categories=categories)
-
-        try:
-            response = self.client.analyze_text(request)
-        except HttpResponseError as e:
-            print("Analyze text failed.")
-            if e.error:
-                print(f"Error code: {e.error.code}")
-                print(f"Error message: {e.error.message}")
-                raise
-            print(e)
-            raise e
-
-        levels = {0: "Safe", 2: "Low", 4: "Medium", 6: "High"}
-
-        severities = [
-            getattr(response, c.name.lower() + "_result").severity for c in categories
-        ]
-
-        DEFAULT_LEVELS = [0, 0, 0, 0]
-
-        is_safe = all([s <= l for s, l in zip(severities, DEFAULT_LEVELS)])
-
-        report = ""
-        if not is_safe:
-            report = "|" + "|".join(f"{c.name:^10}" for c in categories) + "|\n"
-            report += "|" + "|".join(f"{levels[s]:^10}" for s in severities) + "|\n"
-
-        return "Azure Content Saftey API", is_safe, report
-
-
-# Function to load the PeftModel for performance optimization
-# Function to determine which safety checker to use based on the options selected
-def get_safety_checker(enable_azure_content_safety,
-                       enable_sensitive_topics,
-                       enable_salesforce_content_safety,
-                       ):
-    safety_checker = []
-    if enable_azure_content_safety:
-        safety_checker.append(AzureSaftyChecker())
-    if enable_sensitive_topics:
-        safety_checker.append(AuditNLGSensitiveTopics())
-    if enable_salesforce_content_safety:
-        safety_checker.append(SalesforceSafetyChecker())
-    return safety_checker
-
-
-
-
-
diff --git a/src/llama_recipes/utils/instruction_tuning.py b/src/llama_recipes/utils/instruction_tuning.py
index a843529..1cb73aa 100644
--- a/src/llama_recipes/utils/instruction_tuning.py
+++ b/src/llama_recipes/utils/instruction_tuning.py
@@ -4,7 +4,8 @@
 
 import numpy as np
 import torch
-from torch.utils.data import Dataset
+from torch.utils.data import Dataset, DataLoader
+import torch.distributed as torch_distributed
 from transformers.tokenization_utils import PreTrainedTokenizer
 from pathlib import Path
 from llama_recipes.utils.distributed import print_rank_0
@@ -24,6 +25,10 @@ def __init__(
         self.max_words: int = args.seq_length
         self.tokenizer = tokenizer
 
+        # system prompt
+        self.system_prompt_role = args.system_prompt_role
+        self.system_prompt_content = args.system_prompt_content
+
         # index file
         dataset_dir = Path(self.data_path).parent
         index_cache_dir = dataset_dir / ".index_cache"
@@ -54,60 +59,65 @@ def __getitem__(self, index: int) -> dict[str, torch.Tensor]:
                 exit(1)
 
             try:
-                conversations: dict[str, str | list[dict[str, str]]] = json.loads(line)
+                conversations: dict[str, list[dict[str, str]] | str] = json.loads(line)
             except Exception as e:
                 print(f"index={index}, offset={offset}, line={line}, error={e}")
                 exit(1)
 
-        SYSTEM_PROMPT = [
-            {"role": "system", "text": "あなたは誠実で優秀な日本人のアシスタントです。"}
+        SYSTEM_PROMPT: list[dict[str, str]] = [
+            {
+                "role": self.system_prompt_role,
+                "content": self.system_prompt_content,
+            }
         ]
         # chat template
-        prompt: str = self.tokenizer.apply_chat_template(
+        prompt = self.tokenizer.apply_chat_template(
             conversation=SYSTEM_PROMPT + conversations["input"],  # type: ignore
-            tokenize=False
+            add_generation_prompt=True,
+            tokenize=True,
         )
 
-        example: str = prompt + conversations["output"]  # type: ignore
-        encoded_prompt: torch.Tensor = torch.tensor(
-            self.tokenizer.encode(prompt, add_special_tokens=False),
-            dtype=torch.int64
-        )
-        encoded_example: list[int] = self.tokenizer.encode(
-            example, add_special_tokens=False
+        example = self.tokenizer.apply_chat_template(
+            conversation=SYSTEM_PROMPT + conversations["input"] + [  # type: ignore
+                {"role": "assistant", "content": conversations["output"]}
+            ],
+            tokenize=True,
         )
-        encoded_example.append(self.tokenizer.eos_token_id)  # type: ignore
-        encoded_tensor_example: torch.Tensor = torch.tensor(encoded_example, dtype=torch.int64)
-
-        if len(encoded_example) > self.max_words:
-            print(f"\n\nWARNING: example={example}\n\n")
-
-        padding: int = self.max_words - encoded_tensor_example.shape[0]
-        if padding > 0:  # pad_token_id = 0 (substitute unk_token)
-            encoded_tensor_example = torch.cat((encoded_tensor_example, torch.zeros(padding, dtype=torch.int64) - 1))
-        elif padding < 0:
-            encoded_tensor_example = encoded_tensor_example[: self.max_words]
-
-        labels = copy.deepcopy(encoded_tensor_example)
+        tensor_example: torch.Tensor = torch.tensor(example, dtype=torch.int64)
+
+        if len(example) > self.max_words:
+            print(f"\n\nWARNING: example={self.tokenizer.decode(example)}\n\n")
+
+        padding_length: int = self.max_words - len(example)
+        eos_token_id: int = self.tokenizer.encode("<|end_of_text|>", add_special_tokens=False)[0]
+        pad_token_id = eos_token_id
+        if padding_length > 0:
+            pad_tensor = torch.full(
+                (padding_length,), pad_token_id, dtype=torch.int64
+            )
+            tensor_example = torch.cat((tensor_example, pad_tensor))
+        elif padding_length < 0:
+            tensor_example = tensor_example[: self.max_words]
+
+        labels = copy.deepcopy(tensor_example)
         # promptの長さ分だけ -1 で埋める -> 損失関数で無視するようになる
-        labels[: len(encoded_prompt)] = -1
-        # 0より大きい(ge)かどうかの真偽値でmaskを作成
-        example_mask = encoded_tensor_example.ge(0)
+        labels[: len(prompt)] = -1
         label_mask = labels.ge(0)
 
-        if torch.all(label_mask == 0):  # len(output) == 0
+        if torch.all(label_mask == 0):  # 予測部分がない
             random_index: int = np.random.randint(0, len(self.indexes))
             self.__getitem__(random_index)
 
-        # ~example_mask -> paddingの部分を 0 で埋める
-        encoded_tensor_example[~example_mask] = 0
         # ~label_mask -> prompt の部分を ignore_index で埋める
         labels[~label_mask] = IGNORE_INDEX
+        labels[labels == pad_token_id] = IGNORE_INDEX
+        # mask out pad token
+        attention_mask = (tensor_example != pad_token_id).float()
 
         return {
-            "input_ids": encoded_tensor_example,
+            "input_ids": tensor_example,
             "labels": labels,
-            "attention_mask": example_mask.float(),
+            "attention_mask": attention_mask,
         }
 
 
@@ -125,7 +135,7 @@ def get_instruction_tuning_dataloader(
     tokenizer: PreTrainedTokenizer,
     data_path: str,
     train: bool = False,
-) -> torch.utils.data.DataLoader:
+) -> DataLoader:
     from llama_recipes.utils.sequence_length_warmup import CustomDistributedSampler
     from llama_recipes.utils.checkpoint import load_sampler_state_dict
 
@@ -142,8 +152,8 @@ def get_instruction_tuning_dataloader(
 
     train_sampler = CustomDistributedSampler(
         dataset=instruction_dataset,
-        rank=torch.distributed.get_rank(),
-        num_replicas=torch.distributed.get_world_size(),
+        rank=torch_distributed.get_rank(),
+        num_replicas=torch_distributed.get_world_size(),
         shuffle=True,
         seed=args.seed,
     )
@@ -153,7 +163,7 @@ def get_instruction_tuning_dataloader(
 
     set_sampler(sampler=train_sampler)
 
-    return torch.utils.data.DataLoader(
+    return DataLoader(
         instruction_dataset,
         batch_size=args.micro_batch_size,
         sampler=train_sampler,
diff --git a/tools/inference/inference.py b/tools/inference/inference.py
deleted file mode 100644
index 0828a1b..0000000
--- a/tools/inference/inference.py
+++ /dev/null
@@ -1,39 +0,0 @@
-import argparse
-
-import torch
-
-from transformers import AutoTokenizer, MistralForCausalLM
-
-
-parser = argparse.ArgumentParser(description="Generation")
-parser.add_argument("--model-path", type=str)
-parser.add_argument("--tokenizer-path", type=str)
-parser.add_argument("--prompt", type=str, default=None)
-args = parser.parse_args()
-
-
-print(f"Loading model {args.model_path}")
-
-tokenizer = AutoTokenizer.from_pretrained(
-    pretrained_model_name_or_path=args.tokenizer_path,
-)
-model = MistralForCausalLM.from_pretrained(
-    args.model_path,
-    device_map="auto", torch_dtype=torch.bfloat16
-)
-
-input_ids: torch.Tensor = tokenizer.encode(  # type: ignore
-    args.prompt,
-    add_special_tokens=False,
-    return_tensors="pt"
-)
-outputs = model.generate(  # type: ignore
-    input_ids.to(device=model.device),  # type: ignore
-    max_new_tokens=128,
-    temperature=0.99,
-    top_p=0.95,
-    do_sample=True,
-)
-
-generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
-print(generated_text)
diff --git a/tools/inference/inference.sh b/tools/inference/inference.sh
deleted file mode 100644
index 33d8c51..0000000
--- a/tools/inference/inference.sh
+++ /dev/null
@@ -1,33 +0,0 @@
-#!/bin/bash
-#$ -l rt_AG.small=1
-#$ -l h_rt=1:00:00
-#$ -j y
-#$ -o outputs/inference/
-#$ -cwd
-# module load
-source /etc/profile.d/modules.sh
-module load cuda/11.8/11.8.0
-module load cudnn/8.9/8.9.2
-module load nccl/2.16/2.16.2-1
-module load hpcx/2.12
-
-set -e
-
-# swich virtual env
-source .env/bin/activate
-
-# distributed settings
-export MASTER_ADDR=$(/usr/sbin/ip a show dev bond0 | grep 'inet ' | awk '{ print $2 }' | cut -d "/" -f 1)
-export MASTER_PORT=$((10000 + ($JOB_ID % 50000)))
-
-echo "MASTER_ADDR=${MASTER_ADDR}"
-
-python tools/inference/inference.py \
-  --model-path /bb/llm/gaf51275/llama/converted-hf-checkpoint/mistral-7B-VE/okazaki-cc/iter_0004000 \
-  --tokenizer-path /bb/llm/gaf51275/llama/converted-hf-checkpoint/mistral-7B-VE/okazaki-cc/iter_0004000 \
-  --prompt "Tokyo is the capital of Japan."
-
-python tools/inference/inference.py \
-  --model-path /bb/llm/gaf51275/llama/converted-hf-checkpoint/mistral-7B-VE/okazaki-cc/iter_0004000 \
-  --tokenizer-path /bb/llm/gaf51275/llama/converted-hf-checkpoint/mistral-7B-VE/okazaki-cc/iter_0004000 \
-  --prompt "東京工業大学のキャンパスは"